2019年7月16日 星期二

[資料回歸分析] XGBoost模型預測2019溫布敦網球賽結果

之前發表的文章中曾經使用深度學習的兩層密集來預測溫網結果 而本篇文章將討論使用XGBoost是否能夠增加預測精確度

Step 1: 擷取特徵值

  1. 排名
  2. 比賽的贏率
  3. Head to Head on grass
  4. 過去60周的比賽贏率
  5. Percentage of best of 5 set matches won
我們將使用與前案相同的訓練資料

Step 2: 建立XGBoost模型


import pandas as pd
import numpy as np
from keras import models, layers, regularizers
from keras.models import Model
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras import layers, Input, metrics
from keras.models import load_model
import seaborn as sns
import matplotlib.pyplot as plt
#from utils.create_features_utils import *
sns.set_style("darkgrid")
from xgboost import XGBClassifier,XGBRegressor

df = pd.read_csv('predicting-wimbledon-matches-master/data/wimbledon_matches_with_feature.csv')

df = df.dropna()

df['diff_rank'] = df['player_0_rank'] - df['player_1_rank']

#選擇要使用的特徵值
features_list = [
 'diff_rank',
 'diff_match_win_percent',
 'diff_games_win_percent',
 'diff_5_set_match_win_percent',
 'diff_close_sets_percent',
 'diff_match_win_percent_grass',
 'diff_games_win_percent_grass',
 'diff_5_set_match_win_percent_grass',
 'diff_close_sets_percent_grass',
 'diff_match_win_percent_52',
 'diff_games_win_percent_52',
 'diff_5_set_match_win_percent_52',
 'diff_close_sets_percent_52',
 'diff_match_win_percent_grass_60',
 'diff_games_win_percent_grass_60',
 'diff_5_set_match_win_percent_grass_60',
 'diff_close_sets_percent_grass_60',
 'diff_match_win_percent_hh',
 'diff_games_win_percent_hh',
 'diff_match_win_percent_grass_hh',
 'diff_games_win_percent_grass_hh']

target = df.outcome  #Labels
features = df[features_list]  #特徵值
#把資料分成traning set 和 test set: Train (80 %) and Test (20%)
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.20, random_state=1)

#將training Features轉為2D array shape=(樣本數,# of 特徵值)
#trainX_2D=trainX.reshape(df_2019_features.shape[0],trainX.shape[1]*trainX.shape[2])
xgbc = XGBRegressor()
#xgbc = XGBClassifier()
xgbc.fit(train_features, train_target)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0,
       importance_type='gain', learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1)


Step 3: 預測結果分析

xgbc.score(train_features, train_target)
0.5595579716862888

possibility=xgbc.predict(test_features)
prediction=possibility>0.5
prediction=pd.Series(prediction)
prediction=prediction.astype('int')
results=pd.DataFrame({'prediction':prediction.values,'label':test_target.values})
columns=['Round', 'player_0', 'player_1','outcome']
info=df.iloc[test_features.index][columns] info.head()
需注意possibility的意義:
  1. possibility是答案為1的可能性
  2. (1-possibility)是答案為0的可能性,我們將預測答案是0(prediction==0)的possibility改成(1-possibility),如此一來possilibity可以變成信心程度的index,possibility越接近1表示對預測越有信心
以下我們對信心程度(possibility)做分析
info['prediction']=prediction.values
info['possibility']=possibility
info['possibility'][info['prediction']==0]=1-info['possibility'][info['prediction']==0]
#答對的機率 info_pos=(info['outcome']==info['prediction']).sum()/len(info['outcome']) print('答對機率:',info_pos) #outcome==0答對的機率 info_0=info[info['outcome']==0] info_0_pos=(info_0['outcome']==info_0['prediction']).sum()/len(info_0['outcome']) print('outcome==0 答對機率:',info_0_pos) #outcome==1答對的機率 info_1=info[info['outcome']==1] info_1_pos=(info_1['outcome']==info_1['prediction']).sum()/len(info_1['outcome']) print('outcome==1 答對機率:',info_1_pos)
答對機率: 0.740909090909091
outcome==0 答對機率: 0.8848484848484849
outcome==1 答對機率: 0.3090909090909091

從上面分析結果得知,這個模型對於善於預測outcome==0,但對於outcome==1的答對機率卻小於自然分辨率,這也表示說這個model學到的是預測outcome==0答對的機率比較高.

回顧之前的CNN+RNN模型預測溫網的結果:
答對機率: 0.7454545454545455
outcome==0 答對機率: 0.9272727272727272
outcome==1 答對機率: 0.2

我發現使用XGBoost的精確度與深度學習的訓練結果差不多

分析是否預測可能性(possibility,信心程度)越高答對的機率會越高

#答案為零時 for outcome==0 info_0=info[info['outcome']==0] bins=np.arange(0.5,1.01,0.05) possibility_group=pd.cut(info_0['possibility'],bins=bins) df_group_0=info_0.groupby(possibility_group).mean() df_group_num=info_0.groupby(possibility_group).count() g_num=df_group_num.iloc[:,0] df_group_0['# of samples']=g_num df_group_0=df_group_0[['prediction','# of samples']]
#for outcome==1 info_1=info[info['outcome']==1] bins=np.arange(0.5,1.01,0.05) possibility_group=pd.cut(info_1['possibility'],bins=bins) df_group_1=info_1.groupby(possibility_group).mean() df_group_num=info_1.groupby(possibility_group).count() g_num=df_group_num.iloc[:,0] df_group_1['# of samples']=g_num df_group_1=df_group_1[['prediction','# of samples']]
從以下兩張分析結果得知:
  1. 對於label==0的樣本,信心程度越高,預測的結果均值越接近0(但(0.9,0.95]區間只有一筆資料,預測較不準),表示信心程度越高,預測精準度越高
  2. 但對於label==1的樣本,信心程度越高,預測的均值卻沒有越接近1,反而變得越來越小 note: x座標為信心程度,越高表示對於預測結果越有信心.表示這個機器學到的是prediction=0得到的loss function比較小,所以不論如合機器傾向猜測結果是0.
#答案為0的樣本 fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True) df_group_0['prediction'].plot.bar(ax=axis[0]) axis[1].set_xlabel('possibility group') axis[0].set_ylabel('prediction') df_group_0['# of samples'].plot.bar(ax=axis[1]) axis[1].set_ylabel('# of samples') plt.xticks(rotation=90);
#答案為1的樣本 fig,axis=plt.subplots(2,1,figsize=(9,5),sharex=True) df_group_1['prediction'].plot.bar(ax=axis[0]) axis[1].set_xlabel('possibility group') axis[0].set_ylabel('prediction') df_group_1['# of samples'].plot.bar(ax=axis[1]) axis[1].set_ylabel('# of samples') plt.xticks(rotation=90);

預測2019結果

#準備2019的Features
df_2019=pd.read_csv('Wimbledon2019.csv',sep=';') #2019比賽行程與選手資訊
df_2019_features=pd.read_csv('data/wimbledon_matches_with_feature_2019.csv') #2019的訓練資料

#預測結果
xgbc_predicY=xgbc.predict(df_2019_features)

#預測2019的結果, model預測出的值越接近0則0號選手贏的機率越大,反之預測結果越接近1則1好選手贏的機率較大
df_2019['probability'] =xgbc.predict(df_2019_features).flatten()
df_2019['prediction']=df_2019.apply(lambda row:round(row['probability']),axis=1)

#若預測值>0.5, 則預測的1號選手會贏,且他的贏的機率為(probability)
df_2019["probability"] = np.where(df_2019["prediction"]==0, 1-df_2019["probability"], df_2019["probability"])
#最後把預測結果表示成選手名稱
df_2019['prediction_winner']=np.where(df_2019['prediction']==0,df_2019['player_0'],df_2019['player_1'])
del df_2019['prediction']

三倍槓桿和一倍槓桿的長期定期定額報酬率分析

  以下是中國,美國股票債卷的三倍槓桿和一倍槓桿ETF分析.可以發現,三倍槓桿在下跌時期的跌幅遠比一倍槓桿的多 .且從時間軸來看,三倍槓桿由於下跌力道較強,因此會把之前的漲幅都吃掉,所以對於長期上身的市場,例如美國科技股,由於上升時間遠比下跌時間長,所以持有TQQQ的長期回報率會...