2019年6月15日星期六

[交易策略] 以台積電ADR為例

以台積電ADR為例,探討以月線和季線為指標作多和做空台積電為策略的報酬率

大家往往都以日線,月線,季線來判斷股價走多頭和空頭,並利用均線指標來作為買進和賣出的依據(或理由),這種操作方式就短線上來看可能是正確的(會賺到錢),但若長期以這種自以為聰明有策略的方式操作是否會比買進持續持有的笨蛋投資法報酬率高呢？以下我們做兩個實驗,第一個實驗以月線和季線作為買賣指標,第二個實驗以日線和月線作為買賣指標

實驗一: 依照以下方式交易台積電ADR
黃金交叉:月線>季線

死亡交叉:月線<季線

交易策略:

月線-季線>0.5 則作多, all call, 全部的錢拿去做多
月線-季線<-0.5則做空, all put, 全部的其拿去做空
-0.5 <月線-季線< 0.5, 全部的錢換成現金

樣本: 2007/1/1到2019/6/4的台積電ADR 對照組:2007/1/1全部的錢買進後就不再賣出,一直持有到2019/6/4

計算依照交易策略執行的報酬率與對對照組之間的差異

import numpy as np

import pandas as pd

from yahoo_historical import Fetcher

#取得台積電ADR自2007/1/1到2019/6/14的股價

data = Fetcher("TSM", [2007,1,1], [2019,6,14])

#整理表格

df=pd.DataFrame(data.getHistorical())

df.head()

df=df.set_index(['Date'])

df['Close'].plot(grid=True,figsize=(8,5))

#做出21天(一個月)和63天（一季度）的移動平均線

df['21d']=np.round(df['Close'].rolling(21).mean(),2)

df['63d']=np.round(df['Close'].rolling(63).mean(),2)

df[['Close','21d','63d']].plot(grid=True,figsize=(8,5))

#計算21d-63d的價差

df['21-63']=df['21d']-df['63d']

df['21-63'].tail()

#設定交易策略

#1. 月線-季線>0.5 則作多, all call, 全部的錢拿去做多

#2. 月線-季線<-0.5則做空, all put, 全部的其拿去做空

#3. -0.5 <月線-季線< 0.5, 全部的台積電股票換成現金

SD=0.5

df['Regime']=np.where(df['21-63']>SD,1,0)

df['Regime']=np.where(df['21-63']<-SD,-1,df['Regime'])

print("作多(1),做空(-1),持有現金(0)的天數：\n",df['Regime'].value_counts())

import matplotlib.pyplot as plt

df['Regime'].plot(lw=1.5)

plt.ylim([-1.1,1.1])

#daily log return

df['market']=np.log(df['Close']/df['Close'].shift(1)) #T日收盤價/(T-1)日收盤價後取log

df['strategy']=df['Regime'].shift(1)*df['market'] #依照交易策略進行

df[['market','strategy']].cumsum().apply(np.exp).plot(grid=True,figsize=(8,5)) #計算長期累積報酬率

實驗二: 現在我們來檢查另一種常見的交易策略:
交易策略:

日線在季線上則作多, all call, 全部的錢拿去做多
日現在季線下則做空, all put, 全部的其拿去做空

樣本: 2007/1/1到2019/6/4的台積電ADR 對照組:2007/1/1全部的錢買進後就不再賣出,一直持有到2019/6/4

計算依照交易策略執行的報酬率與對對照組之間的差異

按照實驗二的交易策略結果遠比實驗一和對照組(買進長期持有)還要糟很多！！

從結果看來,依照大家認知的交易策略(月線>季線則作多;月線<季線則做空)來操作台積電的長期報酬率遠遠落後買進不管的報酬率, 而且此計算還不包含買進賣出的交易成本與股息的發放, 因此按照過去大家熟知的操作方式來操作台積電的報酬率遠不及買進長期持有的報酬率,因此台積電的最佳投資方式應該為低點買進長期持有(笨蛋投資法), 若不知何時為底點則可採用分批定期定額方式買進 (分批買, 低點買, 有錢買)的策略.

2019年6月13日星期四

[自然語言]用LSTM創作文字內容

步驟：
1. 字元等級神經語言模型(character-level neural language model)：
使用LSTM層從文字庫中(尼采文章)以Ｎ個字元的字串作為輸入, 學習預測第Ｎ+1個字元的機率分佈,來建立字元等級神經語言模型(character-level neural language model).

2. 以逐一字元生成的方式產生文字資料
輸入Ｎ個測試字元的字串,用上面建立的字元等級神經語言模型預測第Ｎ+1個出現機率最高的字元,將此字元加入原先的測試字串末端，再送入語言模型......

import keras
import numpy as np

path = keras.utils.get_file(
'nietzsche.txt',
origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))

maxlen = 60
step = 3

sentences = []

next_chars = []

for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen]) #從i到i+maxlen-1
next_chars.append(text[i + maxlen]) #i+maxlen

print('Number of sequences:', len(sentences))

#做出字元轉換成數字的字典
chars = sorted(list(set(text))) #列出所有出現在文章的字元
print('Unique characters:', len(chars))
char_indices = dict((char, chars.index(char)) for char in chars)

#將訓練的字元向量化
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1

from keras import layers

model = keras.models.Sequential()

model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))

model.add(layers.Dense(len(chars), activation='softmax')) #softmax for機率分佈

optimizer = keras.optimizers.RMSprop(lr=0.01)

model.compile(loss='categorical_crossentropy', optimizer=optimizer)

#原始機率分佈受到溫度擾動而改變分佈機率,並傳回機率最大值的索引

def sample(preds, temperature=1.0):

preds = np.asarray(preds).astype('float64')

preds = np.log(preds) / temperature

exp_preds = np.exp(preds)

preds = exp_preds / np.sum(exp_preds)

probas = np.random.multinomial(1, preds, 1) #丟一次骰子作為一次實驗,每一面出現機率為preds,回傳每個字元出現的“次數”串列

return np.argmax(probas) #回傳結果陣列最大的索引值

import random

import sys

for epoch in range(1, 60):

print('epoch', epoch)

model.fit(x, y, batch_size=128, epochs=1)

start_index = random.randint(0, len(text) - maxlen - 1)

generated_text = text[start_index: start_index + maxlen]

print('--- Generating with seed: "' + generated_text + '"')

for temperature in [0.2, 0.5, 1.0, 1.2]:

print('------ temperature:', temperature)

sys.stdout.write(generated_text) #印出generated_text

#找出接下來最可能出現的400個字元

for i in range(400):

#將測試字串sampled向量化

sampled = np.zeros((1, maxlen, len(chars)))

for t, char in enumerate(generated_text):

sampled[0, t, char_indices[char]] = 1.

#將sampled丟入模型進預測,並不顯示過程(verbose=0)

preds = model.predict(sampled, verbose=0)[0]

#將預測的出現字元機率進行溫度擾動,回傳這個出現字元的索引值

next_index = sample(preds, temperature)

#將字元索引值帶入字典,取得字元

next_char = chars[next_index]

#將字元接到輸入文字的最下方,並刪除第一個字元

generated_text += next_char

generated_text = generated_text[1:]

sys.stdout.write(next_char) #印出generated_text

#第一次訓練

epoch 1
Epoch 1/1
200278/200278 [==============================] - 472s 2ms/step - loss: 1.9867
--- Generating with seed: "ie et sans esprit!

#將“in these later ages, which may be”丟入訓練模型,用不同的溫度來創作文章

229. in these later ages, which may be "
------ temperature: 0.2
ie et sans esprit!
229. in these later ages, which may be the such the still and the sure and and still the present the sure the man and the presenter the for the still and the sure of the from the still the man be the string the man becount and still the the the the sure the still that the soul and a more of the sure the still the stright the stright and the still the sure the sure and the moral the still the sure the sure and the still the sure and the

------ temperature: 0.5
l the still the sure the sure and the still the sure and the perhaps so this disto the philosopher that the string and will the super; and still the same with the desilse of the shill ana
suld conter the free solition of the than the sure for the sure desilse of presentate of the soulh and the for the histances to litely than this incression, and from the preale of contention, in the precestion, that the present to the discienteness than the is and are thi

------ temperature: 1.0
hat the present to the discienteness than the is and are thing, the musf lattire tercies somolian of remord hister
dolecy, with men ye of suses, is and a
corstare--thas and sole of the consciencly to yees as lose. the denchsion in the fantific fan and stone, and trung with their cincolne, and spoled asted to som suef
agage well in regeraving of real the spirits
mistodicato high returdisming
powhing.--as yre? of sulf the doven weally froe it wat give wo le

2019年6月12日星期三

[自然語言] 單字的one-hot encoding 把文字轉換為機器向量語言

做文件的向量化流程如下

將文件分解成小字元(token)
經由一個字典對照表將token編碼成數值向量
把這些數值向量打包成序列張量送入深度學習網路

其中token轉換為向量有兩種主要的方法：

token的one-hot encoding
文字嵌入法(token embedding)

#one-hot encoding：

#1-1. token的one-hot encoding

import numpy as np

samples=['I want to be an machine learning expert.','I need to study a lot.']

token_index={} #建立空字典儲存所有token和key

#建立token_index字典

for sample in samples:

for word in sample.split():

if word not in token_index:

token_index[word]=len(token_index)+1

#將字串的token轉成向量

max_length=10

results=np.zeros(shape=(len(samples),max_length, max(token_index.values())+1))

for i, sample in enumerate(samples):

for j, word in list(enumerate(sample.split()))[:max_length]:

index=token_index.get(word)

results[i,j,index]=1.

# 1-2. 字元的encoding

import string

samples=['I want to be an machine learning expert.','I need to study a lot.']

characters=string.printable #列出所有可以印出的字元

print(len(characters))

token_index=dict(zip(characters,range(1,len(characters)+1))) #把字串變成字典

max_length=50 #只翻譯樣本的前50個字元

results=np.zeros((len(samples),max_length,max(token_index.values())+1))

print(results.shape)

for i, sample in enumerate(samples):

for j, character in enumerate(sample):

index=token_index.get(character)

results[i,j,index]=1.

# 1-3. 使用keras內建工具來做one-hot encoding

from keras.preprocessing.text import Tokenizer

samples=['I want to be an machine learning expert.','I need to study a lot.']

tokenizer=Tokenizer(num_words=1000) #處理前1000個最常用單字

tokenizer.fit_on_texts(samples)

#將文件轉換成數字

sequences=tokenizer.texts_to_sequences(samples)

print(sequences)

#將文字轉換成向量

one_hot_results=tokenizer.texts_to_matrix(samples, mode='binary')#兩位元

print(one_hot_results)

#印出單字和索引對照表

word_index=tokenizer.word_index

print('found %s unique tokens.'%(word_index))

2019年6月10日星期一

[股票]用Keras的RNN預測台積電股價走勢

步驟：
1. 下載台積電股價
2. 將資料做Normalization
3. 準備training set 和 test set:
training set 的Features由60日內的開盤價的陣列組成, 每一筆training set 相差一天
training set 的Labels由第70日的開盤價組成
test set 則由training set最後一筆資料過後的股價組成
4. 用training set建立RNN model
5. 視覺化模擬過程
6. 用test set測試model

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import datetime

from yahoo_historical import Fetcher
#下載台積電在費城半導體掛牌的TSM ADR的股價
data = Fetcher("TSM", [2007,1,1], [2019,1,1])
df=pd.DataFrame(data.getHistorical())
#print(data.getHistorical())
df=df.set_index('Date')
df.head()

df['Open'].plot()


#將資料做Normalization
training_set = df.iloc[:,0:1].values
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)

#準備training set
X_train = []
y_train = []
for i in range(60, 2035):
    X_train.append(training_set_scaled[i-60:i, 0])
    y_train.append(training_set_scaled[i+10, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))



#準備testing set
X_test = []
y_test = []
for i in range(2035, len(training_set_scaled)-10):
    X_test.append(training_set_scaled[i-60:i, 0])
    y_test.append(training_set_scaled[i+10, 0])
X_test, y_test = np.array(X_test), np.array(y_test)

X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))


#匯入Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout


#建立模型
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))
regressor.add(Dense(units = 1))
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics=['mae'])
train_history=regressor.fit(X_train, y_train, validation_split=0.1, epochs = 10, batch_size = 50)


#視覺化訓練過程
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train','validation'],loc='upper left')
    plt.show()



#測試model的準確度
score=regressor.evaluate(X_test,y_test)
print('mae=',score[1])


#進行預測
prediction=regressor.predict(X_test)

#將預測結果轉換成原始座標
prediction_t = sc.inverse_transform(prediction.reshape(-1,1))
y_test_t = sc.inverse_transform(y_test.reshape(-1,1))

#畫出實際股價和預測股價走勢
plt.plot(np.arange(len(y_test)),y_test_t,label='real')
plt.plot(np.arange(len(prediction)),prediction_t,label='prediction')
plt.title('台積電TSM ADR股價走勢模擬')
plt.legend()

這個預測的趨勢看似準確,但實際上這些預測仍然有用到training set之後的股價來當作是input,此外這個模型只用到開盤價的資訊;實務上當日成交量,法人籌碼也都會對於未來股價有所影響,因此日後的model會加入當日成交量,法人籌碼的特徵值

以下我們用yahoo_historical取得的所有表單欄位(Open, High, Low, Close, Adj Close, Volume)當作是Features用來建立model,

經過100次訓練後得到的訓練過程如下：

用六個Features預測股價的擬合程度較用單純開盤價的擬合程度差

2019年6月9日星期日

[自然語言] 文字嵌入(word embedding)法在機器情感判別的實例---IMDB影評分類

文字嵌入法(token embedding)：

相較於one-hot encoding的二進位稀疏向量(大部分由0組成)且具有非常高維度特性; token embedding取得的是低維度的浮點數向量,適用於20000個以上的單字處理

有兩種方法建立文字嵌入向量:

用Embedding layer學習文字嵌入向量
用其他machine learning已經建立好的文字嵌入向量

實際案例：

下載IMDB的影評的[文字]與[評價(值得看(1)or不值得看(0))],並用Keras建立模型

方法：

將下載的檔案解壓縮,資料夾分成training, test; training和test資料夾內又有pos和neg資料夾分別儲存正面和負面的文字檔.
依序打開文字檔,並把它讀入變成陣列
把文字陣列用keras.preprocessing.text.Tokenizer轉換成數字陣列
把不足100個元素的陣列補成長度為100的陣列
建立Keras模型
視覺化訓練過程
測試模型的準確度
儲存模型

import urllib.request
import os
import tarfile
import numpy as np
url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
filepath='data/aclImdb_v1.tar.gz'
#下載壓縮檔案
if not os.path.isfile(filepath):
    result=urllib.request.urlretrieve(url,filepath)
    print('download',result)
#解壓檔案
if not os.path.exists('data/aclImdb'):
    tfile=tarfile.open(filepath,'r:gz')
    result=tfile.extractall('data/')
from keras.preprocessing import sequence #匯入把陣列補齊的module
from keras.preprocessing.text import Tokenizer  #匯入建立字典的module
import re  #匯入處理文字的module
def rm_tags(text):
    re_tag=re.compile(r'<[^>]+>')
    return re_tag.sub('',text)
#整理資料夾的檔案文字成陣列輸出
import os
def read_files(filetype):
    path='data/aclImdb/'
    file_list=[]
    
    positive_path=path+filetype+'/pos/'
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]
        
    negative_path=path+filetype+'/neg/'
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]
    
    print('read',filetype,'files:',len(file_list))
    
    all_labels=[1]*12500+[0]*12500
    
    all_texts=[]
    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(''.join(file_input.readlines()))]
            
    return all_labels,all_texts
y_train,train_text=read_files('train')
y_test,test_text=read_files('test')
#建立字典Token
token=Tokenizer(num_words=2000)
token.fit_on_texts(train_text)
print(token.word_index)
#把影評轉換成數字
x_train_seq=token.texts_to_sequences(train_text)
x_test_seq=token.texts_to_sequences(test_text)
#把每一筆數字list的長度都設成100,不足100的補0
x_train=sequence.pad_sequences(x_train_seq,maxlen=100)
x_test=sequence.pad_sequences(x_test_seq,maxlen=100)
#建立模型
#使用Embedding把數字list轉乘向量list
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten,SimpleRNN
from keras.layers.embeddings import Embedding
model=Sequential()
model.add(Embedding(output_dim=32,input_dim=2000,input_length=100))
#output_dim: 輸出維度; input_dim: 字典數; input_length: 輸入每陣列長度 
model.add(Dropout(0.2))
#embedding嵌入向量是3D向量,需要用Flatten轉換為2D向量 (第0軸不變,第1軸為其他維度相乘)
model.add(Flatten())
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=1,activation='sigmoid'))
model.summary()


model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
train_history=model.fit(x_train,y_train, validation_split=0.1, epochs=30,batch_size=30,verbose=2)
#視覺化訓練過程
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train','validation'],loc='upper left')
    plt.show()
show_train_history(train_history,'acc','val_acc')
#若訓練(train)的準確度一直增加而驗證(validation)的準確度沒有一直增加則可能是overfit

#儲存keras model
model.save('my_model.h5')
#刪除載入的keras model
del model  
from keras.models import load_model
model = load_model('my_model.h5')
model.summary()


#測試keras模型的準確度
y_test=np.transpose(y_test)
scores=model.evaluate(x_test,y_test,verbose=1)
#取得測試準確度
scores[1]
#Add RNN model insize
#建立模型
#使用Embedding把數字list轉乘向量list
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers.embeddings import Embedding
model=Sequential()
model.add(Embedding(output_dim=32,input_dim=2000,input_length=100))
model.add(Dropout(0.2))
model.add(SimpleRNN(units=16))
#model.add(Flatten())
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
train_history=model.fit(x_train,y_train, validation_split=0.1, epochs=30,batch_size=30,verbose=2)
(train_history,'acc','val_acc')
#若訓練(train)的準確度一直增加而驗證(validation)的準確度沒有一直增加則可能是overfit


#儲存keras model
model.save('my_model_RNN.h5')
scores=model.evaluate(x_test,y_test,verbose=1)
scores[1]

[股票] 探討以一日的投本比,外本比買進股票的賺賠錢機率

用2019/01/03的data當作是基準,利用之前寫的外投本比程式計算當天的投本比,外本比買進,賣出前10名個股資料:

外本比=外資當日買賣超張數/該股票的股本
投本比=投信當日買賣超張數/該股票的股本

若以2019/01/03當天台股的籌碼面選股買進, 計算持有數週後的漲幅度,很難看出投信外資當天買進較多的股票日後的漲幅比投信外資當天賣出的股票漲幅還要高...

若2019/1/3依據外投本比買進10支股票每支10萬共100萬, 計算這10支股票未來的平均價值:

參照外本比買超前10的個股買進可以獲得較大的報酬, 但參照投本比買超前10的個股買進卻得到較差的報酬.

結論：依據一日的籌碼集中度做短中線的投資買進或賣出並太不準確，日後我會寫一個評估“一段時間”的籌碼集中度對未來短中期投資買賣的價值評估

2019年6月8日星期六

[影像辨識] 解決樣本空間太少的圖像辨識問題

import os, shutil
original_dataset_dir=r'dogs-vs-cats/train'
base_dir=r'cats_and_dogs_small'
if not os.path.isdir(base_dir): os.mkdir(base_dir)
train_dir=os.path.join(base_dir,'train_dir')
if not os.path.isdir(train_dir):os.mkdir(train_dir)  #建立訓練的檔案夾

validation_dir=os.path.join(base_dir,'validation_dir')
if not os.path.isdir(validation_dir):os.mkdir(validation_dir)  #建立驗證的檔案夾

test_dir=os.path.join(base_dir,'test_dir')
if not os.path.isdir(test_dir):os.mkdir(test_dir) #建立測試的檔案夾

train_cat_dir=os.path.join(train_dir,'train_cat_dir')
if not os.path.isdir(train_cat_dir):os.mkdir(train_cat_dir)  #建立貓訓練的檔案的夾

train_dog_dir=os.path.join(train_dir,'train_dog_dir')
if not os.path.isdir(train_dog_dir):os.mkdir(train_dog_dir)  #建立狗訓練的檔案夾

validation_cat_dir=os.path.join(validation_dir,'validation_cat_dir')
if not os.path.isdir(validation_cat_dir):os.mkdir(validation_cat_dir)  #建立貓驗證的檔案夾

validation_dog_dir=os.path.join(validation_dir,'validation_dog_dir')
if not os.path.isdir(validation_dog_dir):os.mkdir(validation_dog_dir)  #建立狗驗證的檔案夾

test_cat_dir=os.path.join(test_dir,'test_cat_dir')
if not os.path.isdir(test_cat_dir):os.mkdir(test_cat_dir)  #建立貓測試的檔案夾

test_dog_dir=os.path.join(test_dir,'test_dog_dir')
                                                                                                                                                                                            uif not os.path.isdir(test_dog_dir):os.mkdir(test_dog_dir)  #建立狗測試的檔案夾

#複製前面1000張圖片到train_cat_dir訓練目錄下
fname=['cat.{}.jpg'.format(i) for i in range(1000)]
for fname in fname:
    src=os.path.join(original_dataset_dir,fname)
    dst=os.path.join(train_cat_dir,fname)
    shutil.copyfile(src,dst)

#複製下500張圖片到validation_cat_dir訓練目錄下
fname=['cat.{}.jpg'.format(i) for i in range(1000,1500)]
for fname in fname:
    src=os.path.join(original_dataset_dir,fname)
    dst=os.path.join(validation_cat_dir,fname)
    shutil.copyfile(src,dst)

#複製下500張圖片到test_cat_dir訓練目錄下 fname=['cat.{}.jpg'.format(i) for i in range(1500,2000)] for fname in fname: src=os.path.join(original_dataset_dir,fname) dst=os.path.join(test_cat_dir,fname) shutil.copyfile(src,dst)

#複製前面1000張圖片到train_dog_dir訓練目錄下
fname=['dog.{}.jpg'.format(i) for i in range(1000)]
for fname in fname:
    src=os.path.join(original_dataset_dir,fname)
    dst=os.path.join(train_dog_dir,fname)
    shutil.copyfile(src,dst)

#複製下500張圖片到validation_dog_dir訓練目錄下
fname=['dog.{}.jpg'.format(i) for i in range(1000,1500)]
for fname in fname:
    src=os.path.join(original_dataset_dir,fname)
    dst=os.path.join(validation_dog_dir,fname)
    shutil.copyfile(src,dst)

#複製下500張圖片到test_dog_dir訓練目錄下
fname=['dog.{}.jpg'.format(i) for i in range(1500,2000)]
for fname in fname:
    src=os.path.join(original_dataset_dir,fname)
    dst=os.path.join(test_dog_dir,fname)
    shutil.copyfile(src,dst)

print('訓練用的貓圖片數',len(os.listdir(train_cat_dir)))
print('驗證用的貓圖片數',len(os.listdir(validation_cat_dir)))
print('測試用的貓圖片數',len(os.listdir(test_cat_dir)))
print('訓練用的狗圖片數',len(os.listdir(train_dog_dir)))
print('驗證用的狗圖片數',len(os.listdir(validation_dog_dir)))
print('測試用的狗圖片數',len(os.listdir(test_dog_dir)))

#建立神經網路
from keras import layers
from keras import models
model=models.Sequential()
model.add(layers.Conv2D(32,(3,3),activation='relu',input_shape=(150,150,3)))
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Conv2D(64,(3,3),activation='relu'))
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Conv2D(128,(3,3),activation='relu'))
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Flatten())
model.add(layers.Dense(512,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))

model.summary()

from keras import optimizers
model.compile(loss='binary_crossentropy',optimizer=optimizers.RMSprop(lr=1e-4),metrics=['acc'])

#資料預處理
#1. 讀取檔案 2. 將JPEG內容解碼成RGB像素 3. 將RGB像素轉換成福點數張量 4.將像素(0-255)轉換成(0-1)區間
#可以利用keras.preprocessing.image的ImageDataGenerator快速設定Python產生器,自動將影像擋轉換成批次張量
from keras.preprocessing.image import ImageDataGenerator
train_datagen=ImageDataGenerator(rescale=1./255)
test_datagen=ImageDataGenerator(rescale=1./255)
train_generator=train_datagen.flow_from_directory(train_dir,target_size=(150,150),batch_size=20,class_mode='binary')
validation_generator=test_datagen.flow_from_directory(validation_dir,target_size=(150,150),batch_size=20,class_mode='binary')

for data_batch,labels_batch in validation_generator:
    print('data batch shape:',data_batch.shape)
    print('labels batch shape:',labels_batch.shape)
    break

history=model.fit_generator(train_generator,
                            steps_per_epoch=10,
                            epochs=50, 
                            validation_data=validation_generator,
                            validation_steps=50)

model.save('cats_and_dogs_small_1.h5')

import matplotlib.pyplot as plt
acc=history.history['acc']
val_acc=history.history['val_acc']
loss=history.history['loss']
val_loss=history.history['val_loss']
epochs=range(1,len(acc)+1)
plt.plot(epochs,acc,'bo',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()
plt.plot(epochs,loss,'bo',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

因為訓練影像數目少,容易造成overdfit問題,overfit問題解法 1. 使用資料擴增法訓練的樣本空間變大 2. 使用regularization #使用keras不需要設定 3. 使用dropout將神經網路的變數減少以下使用資料擴增法

#可以利用keras.preprocessing.image的ImageDataGenerator擴增資料
datagen=ImageDataGenerator(rotation_range=40,
                           width_shift_range=0.2,
                           height_shift_range=0.2,
                           shear_range=0.2,
                           zoom_range=0.2,
                           horizontal_flip=True,
                           fill_mode='nearest')

from keras.preprocessing import image
fnames=[os.path.join(train_cat_dir,fname) for fname in os.listdir(train_cat_dir)]
#用image.load_img()讀取第三張照片
img=image.load_img(fnames[3],target_size=(150,150))
img

#將影像轉換成矩陣
x=image.img_to_array(img) #影像轉換成(150,150,3)的矩陣
x=x.reshape((1,)+x.shape)  #影像轉換成(1,150,150,3)的矩陣
i=0
for batch in datagen.flow(x,batch_size=1):
    plt.figure(i)
    imgplot=plt.imshow(image.array_to_img(batch[0]))
    i+=1
    if i%3==0:
        break
plt.show()

以上我們知道了如何將有限的樣本空間用keras.preprocessing.image的ImageDataGenerator擴增資料使用資料擴增法訓練的樣本空間變大使用dropout將神經網路的變數減少以下我們要用這兩個方法來解決overfit問題

model=models.Sequential()
model.add(layers.Conv2D(32,(3,3),activation='relu',input_shape=(150,150,3)))
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Conv2D(64,(3,3),activation='relu'))
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Conv2D(128,(3,3),activation='relu'))
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))  #加入dropout層丟棄50%資料
model.add(layers.Dense(512,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))
from keras import optimizers
model.compile(loss='binary_crossentropy',optimizer=optimizers.RMSprop(lr=1e-4),metrics=['acc'])

#設定擴增資料方式
train_datagen=ImageDataGenerator(rescale=1./255,
                                 rotation_range=40,
                                 width_shift_range=0.2,
                                 height_shift_range=0.2,
                                 shear_range=0.2,
                                 zoom_range=0.2,
                                 horizontal_flip=True
                                )
test_datagen=ImageDataGenerator(rescale=1./255)

#使用上述設定的擴增資料方式來擴增資料
train_generator=train_datagen.flow_from_directory(train_dir,
                                                  target_size=(150,150),
                                                  batch_size=32,
                                                  class_mode='binary')
validation_generator=test_datagen.flow_from_directory(validation_dir,
                                                target_size=(150,150),
                                                batch_size=32,
                                                class_mode='binary')
history=model.fit_generator(train_generator,
                            steps_per_epoch=10,
                            epochs=50, 
                            validation_data=validation_generator,
                            validation_steps=50)
model.save('cats_and-dogs_small_2.h5')

import matplotlib.pyplot as plt
acc=history.history['acc']
val_acc=history.history['val_acc']
loss=history.history['loss']
val_loss=history.history['val_loss']
epochs=range(1,len(acc)+1)
plt.plot(epochs,acc,'bo',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()
plt.plot(epochs,loss,'bo',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

model.save('cats_and-dogs_small_2.h5')

2019年6月7日星期五

[股票] 用Python計算每日台股籌碼集中和籌碼分散個股

由於每支股票的股本大小不同,因此我們將外資和投信的買賣張數對股本大小做Normalization這樣算出的籌碼集中度較有意義

外本比=外資當日買賣超張數/該股票的股本
投本比=投信當日買賣超張數/該股票的股本

所需資料: 1. 外資當日買賣張數, 2. 當日收盤價, 3. 股票的股本

1.外資買賣張數：http://www.twse.com.tw/fund/TWT38U?response=html&date=20190606
投信買賣張數：http://www.twse.com.tw/fund/TWT44U?response=html&date=20190606

2. 每日股價資訊：http://www.twse.com.tw/exchangeReport/MI_INDEX?response=csv&date=20190606&type=ALL

3. 股本資料下載：https://www.dropbox.com/s/o0z0p00ap4y6eon/stock_capital.csv?dl=0

import requests as rq
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime
from io import StringIO
import time

#載入股本資料
stock_capital=pd.read_csv('stock_capital.csv',delimiter=';')
del stock_capital[stock_capital.columns[1]]
stock_capital.columns=['證券代號','股本(億)']
stock_capital.iloc[:,0]=stock_capital.iloc[:,0].astype('str')

#下載每日收盤資料
datestr=time.strftime("%Y%m%d", time.localtime()) 
datestr='20190606'
#下載收盤資訊
r = rq.post('http://www.twse.com.tw/exchangeReport/MI_INDEX?response=csv&date=' + datestr + '&type=ALL')
if len(r.text)>0:
    df_stock = pd.read_csv(StringIO("\n".join([i.translate({ord(c): None for c in ' '}) 
                                for i in r.text.split('\n') 
                                    if len(i.split('",')) == 17 and i[0] != '='])), header=0)
    df_stock.to_csv('stock/'+datestr)
    time.sleep( 5 )
#將沒有收開盤價的資料刪除
df_stock=df_stock.drop(df_stock[df_stock['收盤價']=='--'].index)
#將收盤盤價轉換成浮點數
val=[]
for i in df_stock['收盤價'].values:
    j=val.append(float(i.replace(",","")))
val
val1=[]
for i in df_stock['開盤價'].values:
    j=val1.append(float(i.replace(",","")))
val1
df_stock['收盤價']=val
df_stock['開盤價']=val1

#外資買賣超
url_w="http://www.twse.com.tw/fund/TWT38U?response=html&date=20190606"
df_w=pd.read_html(url_w)
df_w=df_w[0].iloc[:,1:6]
columns=[]
for (a,b,c)in list(df_w.columns):
    columns.append(c)
df_w.columns=columns
del df_w['證券名稱']
df_w.iloc[:,0]=df_w.iloc[:,0].astype('str')

#連結三個資料表(stock_capital, df_stock)
df_w_all_stock = pd.merge(df_stock,stock_capital,on='證券代號', how='inner')
df_w_all_stock = pd.merge(df_w_all_stock,df_w,on='證券代號', how='inner')

#取得外本比前10名
index=df_w_all_stock['外本比(%)'].sort_values(ascending=False)[:10].index
df_w_all_stock_top=df_w_all_stock.iloc[index]
#df7['漲跌百分比']=(df7['收盤價'].astype(float)-df7['開盤價'].astype(float))/df7['開盤價'].astype(float)
df_w_all_stock_top=df_w_all_stock_top.loc[:,['證券代號','證券名稱','外本比(%)','本益比','漲跌(+/-)','開盤價','收盤價']]
print('外本比買超前10:')
df_w_all_stock_top

#取得外本比末10名
index=df_w_all_stock['外本比(%)'].sort_values(ascending=False)[-10:].index
df_w_all_stock_bottom=df_w_all_stock.iloc[index]
#df7['漲跌百分比']=(df7['收盤價'].astype(float)-df7['開盤價'].astype(float))/df7['開盤價'].astype(float)
df_w_all_stock_bottom=df_w_all_stock_bottom.loc[:,['證券代號','證券名稱','外本比(%)','本益比','漲跌(+/-)','開盤價','收盤價']]
print('外本比賣超前10:')
df_w_all_stock_bottom

#投本比計算
#投信買賣超
url_t="http://www.twse.com.tw/fund/TWT44U?response=html&date=20190606"
df_t=pd.read_html(url_t)
df_t=df_t[0].iloc[:,1:]
columns=[]
for (a,b)in list(df_t.columns):
    columns.append(b)
df_t.columns=columns
del df_t['證券名稱']
df_t.iloc[:,0]=df_w.iloc[:,0].astype('str')

#投信買賣超
url_t="http://www.twse.com.tw/fund/TWT44U?response=html&date=20190606"
df_t=pd.read_html(url_t)
df_t=df_t[0].iloc[:,1:]
columns=[]
for (a,b)in list(df_t.columns):
    columns.append(b)
df_t.columns=columns
del df_t['證券名稱']
df_t.iloc[:,0]=df_t.iloc[:,0].astype('str')

#連結三個資料表(stock_capital, df_stock,df_t)
df_t_all_stock = pd.merge(df_stock,stock_capital,on='證券代號', how='inner')
df_t_all_stock = pd.merge(df_t_all_stock,df_t,on='證券代號', how='inner')

#計算投本比
df_t_all_stock['投本比(%)']=(df_t_all_stock['買賣超股數'].astype(float)*df_t_all_stock['收盤價'].astype(float)/(df_t_all_stock['股本(億)']*100000000).astype(float))*100

#取得投本比前10名
index=df_t_all_stock['投本比(%)'].sort_values(ascending=False)[:10].index
df_t_all_stock_top=df_t_all_stock.iloc[index]
#df7['漲跌百分比']=(df7['收盤價'].astype(float)-df7['開盤價'].astype(float))/df7['開盤價'].astype(float)
df_t_all_stock_top=df_t_all_stock_top.loc[:,['證券代號','證券名稱','投本比(%)','本益比','漲跌(+/-)','開盤價','收盤價']]
print('投本比買超前10:')
df_t_all_stock_top

##### 取得投本比末10名
index=df_t_all_stock['投本比(%)'].sort_values(ascending=True)[0:10].index
df_t_all_stock_bottom=df_t_all_stock.iloc[index]
df_t_all_stock_bottom=df_t_all_stock_bottom.loc[:,['證券代號','證券名稱','投本比(%)','本益比','漲跌(+/-)','開盤價','收盤價']]
print('投本比賣超前10:')
df_t_all_stock_bottom

[股票] 取得金管會公布之台股市值資料

金管會每月公布一次市值,本益比,週轉率資料

import requests as rq
import pandas as pd
import csv
import matplotlib.pyplot as plt
import numpy as np
url="http://research.fsc.gov.tw/fsd/fncl_od.asp?opendata=FSF024"
r=rq.get(url).content.decode('utf-8')
data=list(csv.reader(r.split('\n'),delimiter=','))
df=pd.DataFrame(data[1:len(data)-1],columns=data[0])
#pd.DataFrame(data)
# 過濾掉前面只有月份的data
df2=df[df.iloc[:,0].astype(int)>105]
df2.astype(float)
df2.index-=18
df2.head()

columns=len(df2.columns)
for i in np.arange(1,columns):
    plt.figure(i)
    plt.title(df2.columns[i])
    plt.plot(df2.iloc[:,i].astype(float))
    plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
    plt.xticks(np.arange(0,df2.shape[0],30),df2.iloc[0::30,0])
#    if i%3==0:
#        break
plt.show()

台股市值占GDP比率過去十年約在150%-180%,在150%以下的時間相當短暫, 因此可用此數值判斷台股高估或低估,若接近180則為高估,若接近150則為便宜價,接近便宜價時可以買進0050, 或0056

2019年6月6日星期四

[資料迴歸分析] K-fold validation 範例 : 預測Boston地區房價

這個範例用來在樣本數相對少的案例,我們需要先把訓練資料分成k群,把其中一群留下來當作是驗證,輪流shuffle k次. 例如,k=4,則把訓練資料分成四群,每次把其中一群保留不訓練,只做為validation的樣本,此種方式又稱為LOO(leave one out)

from keras.datasets import boston_housing
from keras import models
from keras import layers



#載入波士頓房價
(train_data,train_labels),(test_data,test_labels)=boston_housing.load_data()
#處理data
mean=train_data.mean(axis=0)
train_data-=mean

std=train_data.std(axis=0)

train_data/=std
test_data-=mean
test_data/=std

test_data.shape
#(102, 13) 共有102筆測試用data,每筆data有13個特徵值

train_data.shape
(404, 13)
#(404, 13) 共有404筆訓練用data,每筆data有13個特徵值


#建立模型
def build_model():
    model=models.Sequential()
    model.add(layers.Dense(64,activation='relu',input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64,activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop',loss='mse',metrics=['mae'])  #mse: mean square error; mae: mean absolute error
    return model

#把訓練資料分成k群,用for迴圈shuffle  k次
k=4
num_val_samples=len(train_data)//4
num_epochs=100
all_scores=[]
for i in range(k):
    print('processing for #',i)
    val_data=train_data[i*num_val_samples:(i+1)*num_val_samples]
    val_targets=train_labels[i*num_val_samples:(i+1)*num_val_samples]
    partial_train_data=np.concatenate([train_data[:i*num_val_samples],train_data[(i+1)*num_val_samples:]],axis=0)
    partial_train_labels=np.concatenate([train_labels[:i*num_val_samples],train_labels[(i+1)*num_val_samples:]],axis=0)  
    model=build_model()
    history=model.fit(partial_train_data,partial_train_labels, validation_data=(val_data,val_targets),epochs=num_epochs, batch_size=1,verbose=0)
#    val_mse,val_mae=model.evaluate(val_data,val_targets,verbose=1)
    mae_history=history.history['val_mean_absolute_error']
    all_scores.append(mae_history)
    

#視覺化訓練過程
import matplotlib.pyplot as plt
for i in range(k):
    plt.plot(range(1,len(all_scores[i])+1),all_scores[i])
plt.legend([1,2,3,4])
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

2019年6月3日星期一

[網路爬蟲] Python將網頁資料擷取

範例：擷取新北市觀光工廠資料

import requests as rq
from xml.etree import ElementTree
import pandas as pd

url='https://data.ntpc.gov.tw/od/data/api/57EB9B00-979C-44BB-A4EE-CC55BDF1488A?$format=xml'
r=rq.request('GET',url)

tree=ElementTree.fromstring(r.content)
list_data=[]
for i in tree.iter('row'):
    single_record=[]
    for j in i.iter():
        if j.tag=='title' or j.tag=='features' or j.tag=='tel' or j.tag=='address':
            single_record.append(j.text)
    list_data.append(single_record)

df=pd.DataFrame(list_data,columns=['名稱','特色','電話','地址'])
df.index+=1
print(df)

[股票] Python抓取台股每日行情

本篇我們要從台灣證交所的網站上用網路爬蟲的方式將每日的股票價格儲存到電腦主機上首先我們先加入必要的Python模組

import datetime
import requests
from io import StringIO
import pandas as pd
import numpy as np
import time

設定爬蟲的開始和結束的日期'%d-%m-%Y'

start = datetime.datetime.strptime("01-01-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("03-06-2019", "%d-%m-%Y")
date_generated = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]

以for迴圈抓取每日股票資訊,並存入硬碟,休息五秒再讀取下一筆資料以免被遠端伺服器封鎖

for i in date_generated:
    print(i.strftime("%Y%m%d"))
    datestr=i.strftime("%Y%m%d")
    r = requests.post('http://www.twse.com.tw/exchangeReport/MI_INDEX?response=csv&date=' + datestr + '&type=ALL')
    if len(r.text)>0:
        df = pd.read_csv(StringIO("\n".join([i.translate({ord(c): None for c in ' '}) 
                                    for i in r.text.split('\n') 
                                        if len(i.split('",')) == 17 and i[0] != '='])), header=0)
        df.to_csv('stock/'+datestr)
    time.sleep( 5 )

訂閱：意見 (Atom)

地球秘境

2019年6月15日星期六

[交易策略] 以台積電ADR為例

以台積電ADR為例,探討以月線和季線為指標作多和做空台積電為策略的報酬率

2019年6月13日星期四

[自然語言]用LSTM創作文字內容

2019年6月12日星期三

[自然語言] 單字的one-hot encoding 把文字轉換為機器向量語言

做文件的向量化流程如下

#one-hot encoding：

2019年6月10日星期一

[股票]用Keras的RNN預測台積電股價走勢

2019年6月9日星期日

[自然語言] 文字嵌入(word embedding)法在機器情感判別的實例---IMDB影評分類

文字嵌入法(token embedding)：

[股票] 探討以一日的投本比,外本比買進股票的賺賠錢機率

2019年6月8日星期六

[影像辨識] 解決樣本空間太少的圖像辨識問題

2019年6月7日星期五

[股票] 用Python計算每日台股籌碼集中和籌碼分散個股

[股票] 取得金管會公布之台股市值資料

2019年6月6日星期四

[資料迴歸分析] K-fold validation 範例 : 預測Boston地區房價

2019年6月3日星期一

[網路爬蟲] Python將網頁資料擷取

[股票] Python抓取台股每日行情

三倍槓桿和一倍槓桿的長期定期定額報酬率分析

地球秘境

2019年6月15日 星期六

以台積電ADR為例,探討以月線和季線為指標作多和做空台積電為策略的報酬率

2019年6月13日 星期四

2019年6月12日 星期三

做文件的向量化流程如下

#one-hot encoding：

2019年6月10日 星期一

2019年6月9日 星期日

文字嵌入法(token embedding)：

2019年6月8日 星期六

2019年6月7日 星期五

2019年6月6日 星期四

2019年6月3日 星期一

2019年6月15日星期六

2019年6月13日星期四

2019年6月12日星期三

2019年6月10日星期一

2019年6月9日星期日

2019年6月8日星期六

2019年6月7日星期五

2019年6月6日星期四

2019年6月3日星期一