步驟:
- 建立特徵值
- 建立神經網路(兩層hidden layer)
- 預測結果
Step 1: 擷取特徵值
- 排名
- 比賽的贏率
- Head to Head on grass
- 過去60周的比賽贏率
- Percentage of best of 5 set matches won
Features: 每個特徵值是由兩位選手的數值相減的結果,例如:diff_rank=(零號選手排名)-(一號選手排名),設定零號選手的名次比一號選手還要前面
Label: 資料表中的outcome則為我們的Label,設定outcome=0為零號選手贏, outcome=1為一號選手贏
原始資料來源:http://www.tennis-data.co.uk/alldata.php (2010-2018年)
把做好的資料分成traning set 和 test set
Features: 每個特徵值是由兩位選手的數值相減的結果,例如:diff_rank=(零號選手排名)-(一號選手排名),設定零號選手的名次比一號選手還要前面
Label: 資料表中的outcome則為我們的Label,設定outcome=0為零號選手贏, outcome=1為一號選手贏
原始資料來源:http://www.tennis-data.co.uk/alldata.php (2010-2018年)
import pandas as pd
import numpy as np
from keras import models, layers, regularizers
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras import layers, Input, metrics
from keras.models import load_model
import seaborn as sns
import matplotlib.pyplot as plt
#from utils.create_features_utils import *
sns.set_style("darkgrid")
# 下載資料: https://github.com/jugalm/predicting-wimbledon-matches/tree/master/data
df = pd.read_csv('predicting-wimbledon-matches-master/data/wimbledon_matches_with_feature.csv')
df = df.dropna()
df['diff_rank'] = df['player_0_rank'] - df['player_1_rank']
df.head()
#選擇要使用的特徵值
features_list = [
'diff_rank',
'diff_match_win_percent',
'diff_games_win_percent',
'diff_5_set_match_win_percent',
'diff_close_sets_percent',
'diff_match_win_percent_grass',
'diff_games_win_percent_grass',
'diff_5_set_match_win_percent_grass',
'diff_close_sets_percent_grass',
'diff_match_win_percent_52',
'diff_games_win_percent_52',
'diff_5_set_match_win_percent_52',
'diff_close_sets_percent_52',
'diff_match_win_percent_grass_60',
'diff_games_win_percent_grass_60',
'diff_5_set_match_win_percent_grass_60',
'diff_close_sets_percent_grass_60',
'diff_match_win_percent_hh',
'diff_games_win_percent_hh',
'diff_match_win_percent_grass_hh',
'diff_games_win_percent_grass_hh']
target = df.outcome #Labels
features = df[features_list] #特徵值
#把資料分成traning set 和 test set: Train (80 %) and Test (20%)
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.20, random_state=1)
Step 2: 建立神經網路
x= Input(shape=(len(features.columns),)) #輸入為一為向量shape=(n,), n=Features
y=layers.Dense(64,activation='relu')(x)
y=layers.Dropout(0.5)(y)
print(y)
y=layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.01))(y)
y=layers.Dropout(0.5)(y)
print(y)
z=layers.Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))(y)
print(z)
model=Model(x,z)
Tensor("dropout_1/cond/Merge:0", shape=(?, 64), dtype=float32) Tensor("dropout_2/cond/Merge:0", shape=(?, 32), dtype=float32) Tensor("dense_3/Sigmoid:0", shape=(?, 1), dtype=float32)
model.summary()
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 21) 0 _________________________________________________________________ dense_1 (Dense) (None, 64) 1408 _________________________________________________________________ dropout_1 (Dropout) (None, 64) 0 _________________________________________________________________ dense_2 (Dense) (None, 32) 2080 _________________________________________________________________ dropout_2 (Dropout) (None, 32) 0 _________________________________________________________________ dense_3 (Dense) (None, 1) 33 ================================================================= Total params: 3,521 Trainable params: 3,521 Non-trainable params: 0model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', mode='min', verbose=0, patience=500)
mc = ModelCheckpoint('data/best_model.h5', monitor='val_loss', mode='min', verbose=2, save_best_only=True)
history=model.fit(train_features, train_target, epochs=1000, verbose=0, batch_size=128, validation_split=0.2,callbacks=[es, mc])
saved_model = load_model('data/best_model.h5')
Epoch 00001: val_loss improved from inf to 1.31275, saving model to data/best_model.h5 Epoch 00002: val_loss did not improve from 1.31275 Epoch 00003: val_loss did not improve from 1.31275 Epoch 00004: val_loss did not improve from 1.31275 Epoch 00005: val_loss did not improve from 1.31275 Epoch 00006: val_loss did not improve from 1.31275 Epoch 00007: val_loss did not improve from 1.31275 Epoch 00008: val_loss did not improve from 1.31275 Epoch 00009: val_loss did not improve from 1.31275 Epoch 00010: val_loss did not improve from 1.31275 Epoch 00011: val_loss did not improve from 1.31275 Epoch 00012: val_loss did not improve from 1.31275 Epoch 00013: val_loss improved from 1.31275 to 1.24170, saving model to data/best_model.h5 Epoch 00014: val_loss improved from 1.24170 to 1.12599, saving model to data/best_model.h5 Epoch 00015: val_loss improved from 1.12599 to 1.04835, saving model to data/best_model.h5 Epoch 00016: val_loss improved from 1.04835 to 1.00271, saving model to data/best_model.h5
#建立訓練過程視覺化函數
import matplotlib.pyplot as plt
def show_train_history(train_history, train, validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train history')
plt.ylabel(train)
plt.yscale('log')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
#將訓練過程視覺化
show_train_history(history,'loss','val_loss')
show_train_history(history,'acc','val_acc')
Step 3: 預測結果分析
possibility=saved_model.predict(test_features)
prediction=possibility>0.5
prediction=pd.Series(prediction[:,0])
prediction=prediction.astype('int')
results=pd.DataFrame({'prediction':prediction.values,'label':test_target.values})
df.head()
columns=['Round', 'player_0', 'player_1','outcome']
info=df.iloc[test_features.index][columns]
info.head()
Round | player_0 | player_1 | outcome | |
---|---|---|---|---|
774 | 1st Round | Ramos-Vinolas A. | Pospisil V. | 0 |
435 | 2nd Round | Murray A. | Lu Y.H. | 0 |
1032 | 1st Round | Fucsovics M. | Benneteau J. | 1 |
683 | 2nd Round | Mayer L. | Granollers M. | 0 |
804 | 2nd Round | Goffin D. | Roger-Vasselin E. | 0 |
需注意possibility的意義:
- possibility是答案為1的可能性
- (1-possibility)是答案為0的可能性,我們將預測答案是0(prediction==0)的possibility改成(1-possibility),如此一來possilibity可以變成信心程度的index,possibility越接近1表示對預測越有信心
以下我們對信心程度(possibility)做分析
info['prediction']=prediction.values
info['possibility']=possibility
info['possibility'][info['prediction']==0]=1-info['possibility'][info['prediction']==0]
info.head()
Round | player_0 | player_1 | outcome | prediction | possibility | |
---|---|---|---|---|---|---|
774 | 1st Round | Ramos-Vinolas A. | Pospisil V. | 0 | 1 | 0.913388 |
435 | 2nd Round | Murray A. | Lu Y.H. | 0 | 0 | 0.818075 |
1032 | 1st Round | Fucsovics M. | Benneteau J. | 1 | 1 | 0.733076 |
683 | 2nd Round | Mayer L. | Granollers M. | 0 | 0 | 0.740707 |
804 | 2nd Round | Goffin D. | Roger-Vasselin E. | 0 | 0 | 0.772853 |
#答對的機率
info_pos=(info['outcome']==info['prediction']).sum()/len(info['outcome'])
print('答對機率:',info_pos)
#outcome==0答對的機率
info_0=info[info['outcome']==0]
info_0_pos=(info_0['outcome']==info_0['prediction']).sum()/len(info_0['outcome'])
print('outcome==0 答對機率:',info_0_pos)
#outcome==1答對的機率
info_1=info[info['outcome']==1]
info_1_pos=(info_1['outcome']==info_1['prediction']).sum()/len(info_1['outcome'])
print('outcome==1 答對機率:',info_1_pos)
答對機率: 0.7454545454545455 outcome==0 答對機率: 0.9272727272727272 outcome==1 答對機率: 0.2
從上面分析結果得知,這個模型對於善於預測outcome==0,但對於outcome==1的答對機率卻小於自然分辨率
分析是否預測可能性(possibility,信心程度)越高答對的機率會越高
#答案為零時 for outcome==0
info_0=info[info['outcome']==0]
bins=np.arange(0.5,1.01,0.05)
possibility_group=pd.cut(info_0['possibility'],bins=bins)
df_group_0=info_0.groupby(possibility_group).mean()
df_group_num=info_0.groupby(possibility_group).count()
g_num=df_group_num.iloc[:,0]
df_group_0['# of samples']=g_num
df_group_0=df_group_0[['prediction','# of samples']]
#for outcome==1
info_1=info[info['outcome']==1]
bins=np.arange(0.5,1.01,0.05)
possibility_group=pd.cut(info_1['possibility'],bins=bins)
df_group_1=info_1.groupby(possibility_group).mean()
df_group_num=info_1.groupby(possibility_group).count()
g_num=df_group_num.iloc[:,0]
df_group_1['# of samples']=g_num
df_group_1=df_group_1[['prediction','# of samples']]
從以下兩張分析結果得知:
- 對於label==0的樣本,信心程度越高,預測的結果均值越接近0(但(0.9,0.95]區間只有一筆資料,預測較不準),表示信心程度越高,預測精準度越高
- 但對於label==1的樣本,信心程度越高,預測的均值卻沒有越接近1,反而變得越來越小 note: x座標為信心程度,越高表示對於預測結果越有信心
#答案為0的樣本
fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True)
df_group_0['prediction'].plot.bar(ax=axis[0])
axis[1].set_xlabel('possibility group')
axis[0].set_ylabel('prediction')
df_group_0['# of samples'].plot.bar(ax=axis[1])
axis[1].set_ylabel('# of samples')
plt.xticks(rotation=90);
#答案為1的樣本
fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True)
df_group_1['prediction'].plot.bar(ax=axis[0])
axis[1].set_xlabel('possibility group')
axis[0].set_ylabel('prediction')
df_group_1['# of samples'].plot.bar(ax=axis[1])
axis[1].set_ylabel('# of samples')
plt.xticks(rotation=90);
預測2019結果
#準備2019的Features
df_2019=pd.read_csv('Wimbledon2019.csv',sep=';') #2019比賽行程與選手資訊
df_2019_features=pd.read_csv('data/wimbledon_matches_with_feature_2019.csv') #2019的訓練資料
#預測2019的結果,model預測出的值越接近0則0號選手贏的機率越大,反之預測結果越接近1則1好選手贏的機率較大
df_2019['probability'] =saved_model.predict(df_2019_features).flatten()
df_2019['prediction']=df_2019.apply(lambda row:round(row['probability']),axis=1)
#若預測值<0.5, 則預測的0號選手會贏,且他的贏的機率為(1-probability)
#若預測值>0.5, 則預測的1號選手會贏,且他的贏的機率為(probability)
df_2019["probability"] = np.where(df_2019["prediction"]==0, 1-df_2019["probability"], df_2019["probability"])
#最後把預測結果表示成選手名稱
df_2019['prediction_winner']=np.where(df_2019['prediction']==0,df_2019['player_0'],df_2019['player_1'])
del df_2019['prediction']