2019年7月10日 星期三

[資料回歸分析] 用深度學習預測溫布敦網球賽的結果

原文:https://towardsdatascience.com/predicting-wimbledon-matches-using-neural-network-e2ee4d3dead2

步驟:
  1. 建立特徵值
  2. 建立神經網路(兩層hidden layer)
  3. 預測結果

Step 1: 擷取特徵值

  1. 排名
  2. 比賽的贏率
  3. Head to Head on grass
  4. 過去60周的比賽贏率
  5. Percentage of best of 5 set matches won
Features: 每個特徵值是由兩位選手的數值相減的結果,例如:diff_rank=(零號選手排名)-(一號選手排名),設定零號選手的名次比一號選手還要前面
Label: 資料表中的outcome則為我們的Label,設定outcome=0為零號選手贏, outcome=1為一號選手贏image.png原始資料來源:http://www.tennis-data.co.uk/alldata.php (2010-2018年)
把做好的資料分成traning set 和 test set


import pandas as pd
import numpy as np
from keras import models, layers, regularizers
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras import layers, Input, metrics
from keras.models import load_model
import seaborn as sns
import matplotlib.pyplot as plt
#from utils.create_features_utils import *
sns.set_style("darkgrid")
# 下載資料: https://github.com/jugalm/predicting-wimbledon-matches/tree/master/data
df = pd.read_csv('predicting-wimbledon-matches-master/data/wimbledon_matches_with_feature.csv')
df = df.dropna()
df['diff_rank'] = df['player_0_rank'] - df['player_1_rank']

df.head()
#選擇要使用的特徵值
features_list = [
 'diff_rank',
 'diff_match_win_percent',
 'diff_games_win_percent',
 'diff_5_set_match_win_percent',
 'diff_close_sets_percent',
 'diff_match_win_percent_grass',
 'diff_games_win_percent_grass',
 'diff_5_set_match_win_percent_grass',
 'diff_close_sets_percent_grass',
 'diff_match_win_percent_52',
 'diff_games_win_percent_52',
 'diff_5_set_match_win_percent_52',
 'diff_close_sets_percent_52',
 'diff_match_win_percent_grass_60',
 'diff_games_win_percent_grass_60',
 'diff_5_set_match_win_percent_grass_60',
 'diff_close_sets_percent_grass_60',
 'diff_match_win_percent_hh',
 'diff_games_win_percent_hh',
 'diff_match_win_percent_grass_hh',
 'diff_games_win_percent_grass_hh']

target = df.outcome  #Labels
features = df[features_list]  #特徵值
#把資料分成traning set 和 test set: Train (80 %) and Test (20%)
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.20, random_state=1)

Step 2: 建立神經網路

x= Input(shape=(len(features.columns),)) #輸入為一為向量shape=(n,), n=Features
y=layers.Dense(64,activation='relu')(x)
y=layers.Dropout(0.5)(y)
print(y)
y=layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.01))(y)
y=layers.Dropout(0.5)(y)
print(y)
z=layers.Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))(y)
print(z)
model=Model(x,z)

Tensor("dropout_1/cond/Merge:0", shape=(?, 64), dtype=float32)
Tensor("dropout_2/cond/Merge:0", shape=(?, 32), dtype=float32)
Tensor("dense_3/Sigmoid:0", shape=(?, 1), dtype=float32)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 21)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                1408      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
Total params: 3,521
Trainable params: 3,521
Non-trainable params: 0

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', mode='min', verbose=0, patience=500)
mc = ModelCheckpoint('data/best_model.h5', monitor='val_loss', mode='min', verbose=2, save_best_only=True)

history=model.fit(train_features, train_target, epochs=1000, verbose=0, batch_size=128, validation_split=0.2,callbacks=[es, mc])
saved_model = load_model('data/best_model.h5')


Epoch 00001: val_loss improved from inf to 1.31275, saving model to data/best_model.h5

Epoch 00002: val_loss did not improve from 1.31275

Epoch 00003: val_loss did not improve from 1.31275

Epoch 00004: val_loss did not improve from 1.31275

Epoch 00005: val_loss did not improve from 1.31275

Epoch 00006: val_loss did not improve from 1.31275

Epoch 00007: val_loss did not improve from 1.31275

Epoch 00008: val_loss did not improve from 1.31275

Epoch 00009: val_loss did not improve from 1.31275

Epoch 00010: val_loss did not improve from 1.31275

Epoch 00011: val_loss did not improve from 1.31275

Epoch 00012: val_loss did not improve from 1.31275

Epoch 00013: val_loss improved from 1.31275 to 1.24170, saving model to data/best_model.h5

Epoch 00014: val_loss improved from 1.24170 to 1.12599, saving model to data/best_model.h5

Epoch 00015: val_loss improved from 1.12599 to 1.04835, saving model to data/best_model.h5

Epoch 00016: val_loss improved from 1.04835 to 1.00271, saving model to data/best_model.h5


#建立訓練過程視覺化函數
import matplotlib.pyplot as plt
def show_train_history(train_history, train, validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train history')
    plt.ylabel(train)
    plt.yscale('log')
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

#將訓練過程視覺化
show_train_history(history,'loss','val_loss')
show_train_history(history,'acc','val_acc')

Step 3: 預測結果分析


possibility=saved_model.predict(test_features)
prediction=possibility>0.5
prediction=pd.Series(prediction[:,0])
prediction=prediction.astype('int')
results=pd.DataFrame({'prediction':prediction.values,'label':test_target.values})
df.head()

columns=['Round', 'player_0', 'player_1','outcome']
info=df.iloc[test_features.index][columns]
info.head()

Roundplayer_0player_1outcome
7741st RoundRamos-Vinolas A.Pospisil V.0
4352nd RoundMurray A.Lu Y.H.0
10321st RoundFucsovics M.Benneteau J.1
6832nd RoundMayer L.Granollers M.0
8042nd RoundGoffin D.Roger-Vasselin E.0

需注意possibility的意義:
  1. possibility是答案為1的可能性
  2. (1-possibility)是答案為0的可能性,我們將預測答案是0(prediction==0)的possibility改成(1-possibility),如此一來possilibity可以變成信心程度的index,possibility越接近1表示對預測越有信心
以下我們對信心程度(possibility)做分析
info['prediction']=prediction.values
info['possibility']=possibility
info['possibility'][info['prediction']==0]=1-info['possibility'][info['prediction']==0]
info.head()
Roundplayer_0player_1outcomepredictionpossibility
7741st RoundRamos-Vinolas A.Pospisil V.010.913388
4352nd RoundMurray A.Lu Y.H.000.818075
10321st RoundFucsovics M.Benneteau J.110.733076
6832nd RoundMayer L.Granollers M.000.740707
8042nd RoundGoffin D.Roger-Vasselin E.000.772853

#答對的機率
info_pos=(info['outcome']==info['prediction']).sum()/len(info['outcome'])
print('答對機率:',info_pos)
#outcome==0答對的機率
info_0=info[info['outcome']==0]
info_0_pos=(info_0['outcome']==info_0['prediction']).sum()/len(info_0['outcome'])
print('outcome==0 答對機率:',info_0_pos)

#outcome==1答對的機率
info_1=info[info['outcome']==1]
info_1_pos=(info_1['outcome']==info_1['prediction']).sum()/len(info_1['outcome'])
print('outcome==1 答對機率:',info_1_pos)

答對機率: 0.7454545454545455
outcome==0 答對機率: 0.9272727272727272
outcome==1 答對機率: 0.2

從上面分析結果得知,這個模型對於善於預測outcome==0,但對於outcome==1的答對機率卻小於自然分辨率


分析是否預測可能性(possibility,信心程度)越高答對的機率會越高


#答案為零時 for outcome==0
info_0=info[info['outcome']==0]
bins=np.arange(0.5,1.01,0.05)
possibility_group=pd.cut(info_0['possibility'],bins=bins)
df_group_0=info_0.groupby(possibility_group).mean()
df_group_num=info_0.groupby(possibility_group).count()
g_num=df_group_num.iloc[:,0]
df_group_0['# of samples']=g_num
df_group_0=df_group_0[['prediction','# of samples']]

#for outcome==1
info_1=info[info['outcome']==1]
bins=np.arange(0.5,1.01,0.05)
possibility_group=pd.cut(info_1['possibility'],bins=bins)
df_group_1=info_1.groupby(possibility_group).mean()
df_group_num=info_1.groupby(possibility_group).count()
g_num=df_group_num.iloc[:,0]
df_group_1['# of samples']=g_num
df_group_1=df_group_1[['prediction','# of samples']]

從以下兩張分析結果得知:
  1. 對於label==0的樣本,信心程度越高,預測的結果均值越接近0(但(0.9,0.95]區間只有一筆資料,預測較不準),表示信心程度越高,預測精準度越高
  2. 但對於label==1的樣本,信心程度越高,預測的均值卻沒有越接近1,反而變得越來越小 note: x座標為信心程度,越高表示對於預測結果越有信心


#答案為0的樣本

fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True)

df_group_0['prediction'].plot.bar(ax=axis[0])

axis[1].set_xlabel('possibility group')

axis[0].set_ylabel('prediction')

df_group_0['# of samples'].plot.bar(ax=axis[1])

axis[1].set_ylabel('# of samples')

plt.xticks(rotation=90);



#答案為1的樣本
fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True)
df_group_1['prediction'].plot.bar(ax=axis[0])
axis[1].set_xlabel('possibility group')
axis[0].set_ylabel('prediction')
df_group_1['# of samples'].plot.bar(ax=axis[1])
axis[1].set_ylabel('# of samples')
plt.xticks(rotation=90);

預測2019結果

#準備2019的Features
df_2019=pd.read_csv('Wimbledon2019.csv',sep=';') #2019比賽行程與選手資訊
df_2019_features=pd.read_csv('data/wimbledon_matches_with_feature_2019.csv') #2019的訓練資料
#預測2019的結果,model預測出的值越接近0則0號選手贏的機率越大,反之預測結果越接近1則1好選手贏的機率較大
df_2019['probability'] =saved_model.predict(df_2019_features).flatten()
df_2019['prediction']=df_2019.apply(lambda row:round(row['probability']),axis=1)
#若預測值<0.5, 則預測的0號選手會贏,且他的贏的機率為(1-probability)
#若預測值>0.5, 則預測的1號選手會贏,且他的贏的機率為(probability)
df_2019["probability"] = np.where(df_2019["prediction"]==0, 1-df_2019["probability"], df_2019["probability"])
#最後把預測結果表示成選手名稱
df_2019['prediction_winner']=np.where(df_2019['prediction']==0,df_2019['player_0'],df_2019['player_1'])
del df_2019['prediction']
df_2019


三倍槓桿和一倍槓桿的長期定期定額報酬率分析

  以下是中國,美國股票債卷的三倍槓桿和一倍槓桿ETF分析.可以發現,三倍槓桿在下跌時期的跌幅遠比一倍槓桿的多 .且從時間軸來看,三倍槓桿由於下跌力道較強,因此會把之前的漲幅都吃掉,所以對於長期上身的市場,例如美國科技股,由於上升時間遠比下跌時間長,所以持有TQQQ的長期回報率會...