原文：https://towardsdatascience.com/predicting-wimbledon-matches-using-neural-network-e2ee4d3dead2

步驟：

建立特徵值
建立神經網路(兩層hidden layer)
預測結果

Step 1: 擷取特徵值

排名

比賽的贏率

Head to Head on grass

過去60周的比賽贏率

Percentage of best of 5 set matches won

Features: 每個特徵值是由兩位選手的數值相減的結果,例如:diff_rank=(零號選手排名)-(一號選手排名),設定零號選手的名次比一號選手還要前面

Label: 資料表中的outcome則為我們的Label,設定outcome=0為零號選手贏, outcome=1為一號選手贏原始資料來源：http://www.tennis-data.co.uk/alldata.php (2010-2018年)

把做好的資料分成traning set 和 test set

import pandas as pd

import numpy as np

from keras import models, layers, regularizers

from keras.models import Model

from keras.callbacks import ModelCheckpoint, EarlyStopping

from keras.preprocessing.text import Tokenizer

from sklearn.model_selection import train_test_split

from keras import layers, Input, metrics

from keras.models import load_model

import seaborn as sns

import matplotlib.pyplot as plt

#from utils.create_features_utils import *

sns.set_style("darkgrid")

# 下載資料: https://github.com/jugalm/predicting-wimbledon-matches/tree/master/data

df = pd.read_csv('predicting-wimbledon-matches-master/data/wimbledon_matches_with_feature.csv')

df = df.dropna()

df['diff_rank'] = df['player_0_rank'] - df['player_1_rank']

df.head()

#選擇要使用的特徵值

features_list = [

'diff_rank',

'diff_match_win_percent',

'diff_games_win_percent',

'diff_5_set_match_win_percent',

'diff_close_sets_percent',

'diff_match_win_percent_grass',

'diff_games_win_percent_grass',

'diff_5_set_match_win_percent_grass',

'diff_close_sets_percent_grass',

'diff_match_win_percent_52',

'diff_games_win_percent_52',

'diff_5_set_match_win_percent_52',

'diff_close_sets_percent_52',

'diff_match_win_percent_grass_60',

'diff_games_win_percent_grass_60',

'diff_5_set_match_win_percent_grass_60',

'diff_close_sets_percent_grass_60',

'diff_match_win_percent_hh',

'diff_games_win_percent_hh',

'diff_match_win_percent_grass_hh',

'diff_games_win_percent_grass_hh']

target = df.outcome #Labels

features = df[features_list] #特徵值

#把資料分成traning set 和 test set: Train (80 %) and Test (20%)

train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.20, random_state=1)

Step 2: 建立神經網路

x= Input(shape=(len(features.columns),)) #輸入為一為向量shape=(n,), n=Features

y=layers.Dense(64,activation='relu')(x)

y=layers.Dropout(0.5)(y)

print(y)

y=layers.Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.01))(y)

y=layers.Dropout(0.5)(y)

print(y)

z=layers.Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))(y)

print(z)

model=Model(x,z)

Tensor("dropout_1/cond/Merge:0", shape=(?, 64), dtype=float32)
Tensor("dropout_2/cond/Merge:0", shape=(?, 32), dtype=float32)
Tensor("dense_3/Sigmoid:0", shape=(?, 1), dtype=float32)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 21)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                1408      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
=================================================================
Total params: 3,521
Trainable params: 3,521
Non-trainable params: 0

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

es = EarlyStopping(monitor='val_loss', mode='min', verbose=0, patience=500)

mc = ModelCheckpoint('data/best_model.h5', monitor='val_loss', mode='min', verbose=2, save_best_only=True)

history=model.fit(train_features, train_target, epochs=1000, verbose=0, batch_size=128, validation_split=0.2,callbacks=[es, mc])

saved_model = load_model('data/best_model.h5')

Epoch 00001: val_loss improved from inf to 1.31275, saving model to data/best_model.h5

Epoch 00002: val_loss did not improve from 1.31275

Epoch 00003: val_loss did not improve from 1.31275

Epoch 00004: val_loss did not improve from 1.31275

Epoch 00005: val_loss did not improve from 1.31275

Epoch 00006: val_loss did not improve from 1.31275

Epoch 00007: val_loss did not improve from 1.31275

Epoch 00008: val_loss did not improve from 1.31275

Epoch 00009: val_loss did not improve from 1.31275

Epoch 00010: val_loss did not improve from 1.31275

Epoch 00011: val_loss did not improve from 1.31275

Epoch 00012: val_loss did not improve from 1.31275

Epoch 00013: val_loss improved from 1.31275 to 1.24170, saving model to data/best_model.h5

Epoch 00014: val_loss improved from 1.24170 to 1.12599, saving model to data/best_model.h5

Epoch 00015: val_loss improved from 1.12599 to 1.04835, saving model to data/best_model.h5

Epoch 00016: val_loss improved from 1.04835 to 1.00271, saving model to data/best_model.h5

#建立訓練過程視覺化函數

import matplotlib.pyplot as plt

def show_train_history(train_history, train, validation):

plt.plot(train_history.history[train])

plt.plot(train_history.history[validation])

plt.title('Train history')

plt.ylabel(train)

plt.yscale('log')

plt.xlabel('Epoch')

plt.legend(['train', 'validation'], loc='upper left')

plt.show()

#將訓練過程視覺化

show_train_history(history,'loss','val_loss')

show_train_history(history,'acc','val_acc')

Step 3: 預測結果分析

possibility=saved_model.predict(test_features)

prediction=possibility>0.5

prediction=pd.Series(prediction[:,0])

prediction=prediction.astype('int')

results=pd.DataFrame({'prediction':prediction.values,'label':test_target.values})

df.head()

columns=['Round', 'player_0', 'player_1','outcome']

info=df.iloc[test_features.index][columns]

info.head()

	Round	player_0	player_1	outcome
774	1st Round	Ramos-Vinolas A.	Pospisil V.	0
435	2nd Round	Murray A.	Lu Y.H.	0
1032	1st Round	Fucsovics M.	Benneteau J.	1
683	2nd Round	Mayer L.	Granollers M.	0
804	2nd Round	Goffin D.	Roger-Vasselin E.	0

需注意possibility的意義：

possibility是答案為1的可能性
(1-possibility)是答案為0的可能性,我們將預測答案是0(prediction==0)的possibility改成(1-possibility),如此一來possilibity可以變成信心程度的index,possibility越接近1表示對預測越有信心

以下我們對信心程度(possibility)做分析

info['prediction']=prediction.values

info['possibility']=possibility

info['possibility'][info['prediction']==0]=1-info['possibility'][info['prediction']==0]

info.head()

	Round	player_0	player_1	outcome	prediction	possibility
774	1st Round	Ramos-Vinolas A.	Pospisil V.	0	1	0.913388
435	2nd Round	Murray A.	Lu Y.H.	0	0	0.818075
1032	1st Round	Fucsovics M.	Benneteau J.	1	1	0.733076
683	2nd Round	Mayer L.	Granollers M.	0	0	0.740707
804	2nd Round	Goffin D.	Roger-Vasselin E.	0	0	0.772853

#答對的機率

info_pos=(info['outcome']==info['prediction']).sum()/len(info['outcome'])

print('答對機率:',info_pos)

#outcome==0答對的機率

info_0=info[info['outcome']==0]

info_0_pos=(info_0['outcome']==info_0['prediction']).sum()/len(info_0['outcome'])

print('outcome==0 答對機率:',info_0_pos)

#outcome==1答對的機率

info_1=info[info['outcome']==1]

info_1_pos=(info_1['outcome']==info_1['prediction']).sum()/len(info_1['outcome'])

print('outcome==1 答對機率:',info_1_pos)

答對機率: 0.7454545454545455
outcome==0 答對機率: 0.9272727272727272
outcome==1 答對機率: 0.2

從上面分析結果得知,這個模型對於善於預測outcome==0,但對於outcome==1的答對機率卻小於自然分辨率

分析是否預測可能性(possibility,信心程度)越高答對的機率會越高

#答案為零時 for outcome==0

info_0=info[info['outcome']==0]

bins=np.arange(0.5,1.01,0.05)

possibility_group=pd.cut(info_0['possibility'],bins=bins)

df_group_0=info_0.groupby(possibility_group).mean()

df_group_num=info_0.groupby(possibility_group).count()

g_num=df_group_num.iloc[:,0]

df_group_0['# of samples']=g_num

df_group_0=df_group_0[['prediction','# of samples']]

#for outcome==1

info_1=info[info['outcome']==1]

bins=np.arange(0.5,1.01,0.05)

possibility_group=pd.cut(info_1['possibility'],bins=bins)

df_group_1=info_1.groupby(possibility_group).mean()

df_group_num=info_1.groupby(possibility_group).count()

g_num=df_group_num.iloc[:,0]

df_group_1['# of samples']=g_num

df_group_1=df_group_1[['prediction','# of samples']]

從以下兩張分析結果得知:

對於label==0的樣本,信心程度越高,預測的結果均值越接近0(但(0.9,0.95]區間只有一筆資料,預測較不準),表示信心程度越高,預測精準度越高
但對於label==1的樣本,信心程度越高,預測的均值卻沒有越接近1,反而變得越來越小 note: x座標為信心程度,越高表示對於預測結果越有信心

#答案為0的樣本

fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True)

df_group_0['prediction'].plot.bar(ax=axis[0])

axis[1].set_xlabel('possibility group')

axis[0].set_ylabel('prediction')

df_group_0['# of samples'].plot.bar(ax=axis[1])

axis[1].set_ylabel('# of samples')

plt.xticks(rotation=90);

#答案為1的樣本

fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True)

df_group_1['prediction'].plot.bar(ax=axis[0])

axis[1].set_xlabel('possibility group')

axis[0].set_ylabel('prediction')

df_group_1['# of samples'].plot.bar(ax=axis[1])

axis[1].set_ylabel('# of samples')

plt.xticks(rotation=90);

預測2019結果

#準備2019的Features

df_2019=pd.read_csv('Wimbledon2019.csv',sep=';') #2019比賽行程與選手資訊

df_2019_features=pd.read_csv('data/wimbledon_matches_with_feature_2019.csv') #2019的訓練資料

#預測2019的結果,model預測出的值越接近0則0號選手贏的機率越大,反之預測結果越接近1則1好選手贏的機率較大

df_2019['probability'] =saved_model.predict(df_2019_features).flatten()

df_2019['prediction']=df_2019.apply(lambda row:round(row['probability']),axis=1)

#若預測值<0.5, 則預測的0號選手會贏,且他的贏的機率為(1-probability)

#若預測值>0.5, 則預測的1號選手會贏,且他的贏的機率為(probability)

df_2019["probability"] = np.where(df_2019["prediction"]==0, 1-df_2019["probability"], df_2019["probability"])

#最後把預測結果表示成選手名稱

df_2019['prediction_winner']=np.where(df_2019['prediction']==0,df_2019['player_0'],df_2019['player_1'])

del df_2019['prediction']

df_2019

地球秘境

2019年7月10日星期三

[資料回歸分析] 用深度學習預測溫布敦網球賽的結果

Step 1: 擷取特徵值

Step 2: 建立神經網路

Step 3: 預測結果分析

分析是否預測可能性(possibility,信心程度)越高答對的機率會越高

預測2019結果

三倍槓桿和一倍槓桿的長期定期定額報酬率分析

地球秘境

2019年7月10日 星期三

[資料回歸分析] 用深度學習預測溫布敦網球賽的結果

Step 1: 擷取特徵值

Step 2: 建立神經網路

Step 3: 預測結果分析

分析是否預測可能性(possibility,信心程度)越高答對的機率會越高

預測2019結果

三倍槓桿和一倍槓桿的長期定期定額報酬率分析

2019年7月10日星期三