量化投资学习笔记40——模型融合

集成把不同模型的预测结果结合起来,生成最终预测,集成的模型越多,效果就越好。另外,由于集成结合了不同的基线预测,它们的性能至少等同于最优的基线模型。集成使得我们几乎免费就获得了性能提升!
集成的基本概念:结合多个模型的预测,对特异性误差取平均,从而获得更好的整体预测结果。

stacking:stacking是一种分层模型集成框架。以两层为例,第一层由多个基学习器组成,其输入为原始训练集,第二层的模型则是以第一层基学习器的输出作为特征加入训练集进行再训练,从而得到完整的stacking模型。
不同的模型,是在不同的角度观察我们的数据集。
基本步骤:
①选择基模型。各种基本的机器学习算法。
②把训练集分成不交叉的若干份。
③将其中一份作为预测集,使用其它份进行建模,预测预测集,保留结果。
④把预测结果按照对应的位置填上,得到对整个数据集在第一个基模型上的一个stacking转换。
⑤在④的过程中,每个模型分别对测试集进行预测,并保留这五列结果,取平均值,作为该基模型对测试集数据的一个stacking转换。
⑥对其它基模型重复②-⑤步。
⑦一般使用LR作为第二层的模型进行建模预测。

整个过程比较耗时,尽量增加一些不同类型的模型。
下面实操一下吧,参考了[8]
问题是假设四个人扔187个飞镖。其中150个飞镖观察是谁扔的,扔到了哪里(训练数据)。剩下的作为测试数据,知道飞镖的位置。我们的学习任务是根据飞镖的位置,猜测是谁扔的。
先读取数据。

1
2
3
4
5
6
7
8
9
10
import numpy as np
import pandas as pd


if __name__ == "__main__":
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
print(train_data.head())
print(train_data.info())
print(test_data.info())


这是有监督学习的分类问题,把能用的算法都用上看看吧。
先画图看看

1
2
3
4
5
6
7
8
9
10
11
# 画图看看
fig = plt.figure()
colors = ['b','g','r','orange']
labels = ['Sue', 'Kate', 'Mark', 'Bob']
for index in range(4):
x = train_data.loc[train_data["Competitor"] == labels[index]]["XCoord"]
y = train_data.loc[train_data["Competitor"] == labels[index]]["YCoord"]
plt.scatter(x, y, c = colors[index], label = labels[index])
plt.legend(loc = "best")
plt.savefig("data.png")
plt.close()


可以看到四个人投掷的坐标还是有明显差别的。
下面就一个算法一个算法先试吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
from sklearn.linear_model import LogisticRegression #逻辑回归
from sklearn.svm import SVC, LinearSVC #支持向量机
from sklearn.ensemble import RandomForestClassifier #随机森林
from sklearn.neighbors import KNeighborsClassifier #K最邻近算法
from sklearn.naive_bayes import GaussianNB #朴素贝叶斯
from sklearn.linear_model import Perceptron #感知机算法
from sklearn.linear_model import SGDClassifier #梯度下降分类
from sklearn.tree import DecisionTreeClassifier #决策树算法
from sklearn.model_selection import StratifiedKFold #K折交叉切分
from sklearn.model_selection import GridSearchCV #网格搜索

from sklearn.model_selection import cross_val_score


# 对模型进行交叉验证
def cross_val(model, X, Y, cv=5):
scores = cross_val_score(model, X, Y, cv=cv)
score = scores.mean()
return score


# 模型评分
def ModelTest(Model, X_train, Y_train):
Model.fit(X_train, Y_train)
# 对模型评分
acc_result = round(Model.score(X_train, Y_train)*100, 2)
return a


# 基本模型比较
# 划分数据
X_train = train_data.drop(["Competitor", "ID"], axis = 1)
print(X_train.info())
Y_train = train_data["Competitor"]
X_test = test_data.drop(["Competitor", "ID"], axis = 1)
Y_test = test_data["Competitor"]
# 逻辑回归模型
LogModel = LogisticRegression()
acc_log = ModelTest(LogModel, X_train, Y_train)
print("逻辑回归结果:{}".format(acc_log))
cross_log = cross_val(LogModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_log))

# SVM支持向量机模型
SVMModel = SVC()
acc_svc = ModelTest(SVMModel, X_train, Y_train)
print("支持向量机结果:{}".format(acc_svc))
cross_svc = cross_val(SVMModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_svc))

# knn算法
knnModel = KNeighborsClassifier(n_neighbors = 4)
acc_knn = ModelTest(knnModel, X_train, Y_train)
print("knn结果:{}".format(acc_knn))
cross_knn = cross_val(knnModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_knn))

# 朴素贝叶斯模型
BYSModel = GaussianNB()
acc_bys = ModelTest(BYSModel, X_train, Y_train)
print("朴素贝叶斯算法结果:{}".format(acc_bys))
cross_bys = cross_val(BYSModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_bys))

# 感知机算法
percModel = Perceptron()
acc_perc = ModelTest(percModel, X_train, Y_train)
print("感知机算法算法结果:{}".format(acc_perc))
cross_perc = cross_val(percModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_perc))

# 线性分类支持向量机
lin_svcModel = LinearSVC()
acc_lin_svc = ModelTest(lin_svcModel, X_train, Y_train)
print("线性分类支持向量机算法结果:{}".format(acc_lin_svc))
cross_lin_svc = cross_val(lin_svcModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_lin_svc))

# 梯度下降分类算法
sgdModel = SGDClassifier()
acc_sgd = ModelTest(sgdModel, X_train, Y_train)
print("梯度下降分类算法结果:{}".format(acc_sgd))
cross_sgd = cross_val(sgdModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_sgd))

# 决策树算法
treeModel = DecisionTreeClassifier()
acc_tree = ModelTest(treeModel, X_train, Y_train)
print("决策树算法结果:{}".format(acc_tree))
cross_tree = cross_val(treeModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_tree))

# 随机森林算法
forestModel = RandomForestClassifier()
acc_rand = ModelTest(forestModel, X_train, Y_train)
print("随机森林算法结果:{}".format(acc_rand))
cross_rand = cross_val(forestModel, X_train, Y_train, cv = 5)
print("交叉验证得分:%.3f" % (cross_rand))

# 模型评分
print("模型评分")
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC','Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log, acc_rand, acc_bys, acc_perc, acc_sgd, acc_lin_svc, acc_tree]})
print(models.sort_values(by='Score', ascending=False))

# 模型交叉验证评分
print("模型交叉验证评分")
models_cross_val = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC','Decision Tree'],
'Score': [cross_svc, cross_knn, cross_log, cross_rand, cross_bys, cross_perc, cross_sgd, cross_lin_svc, cross_tree]})
print(models_cross_val.sort_values(by='Score', ascending=False))

结果如下:

用评分前五名的决策树,随机森林,KNN,朴素贝叶斯,支持向量机算法进行stacking。先计算各个模型对测试集预测的正确率。先写一个测试函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 用测试集检验模型预测的正确率
def testModel(Model, X_train, Y_train, X_test, Y_test):
Model.fit(X_train, Y_train)
# 用模型对测试集数据进行预测
res = Model.predict(X_test)
print(res, Y_test.values)
n = 0.0
Y_test_value = Y_test.values
for i in range(len(Y_test_value)):
# print(i, res[i], Y_test.iloc[i:1])
if res[i] == Y_test_value[i]:
n += 1.0
score = n/len(Y_test)
return score

然后用五个模型分别预测。

1
2
3
4
5
6
#分别测试
DT_score = testModel(treeModel, X_train, Y_train, X_test, Y_test)
RF_score = testModel(forestModel, X_train, Y_train, X_test, Y_test)
KNN_score = testModel(knnModel, X_train, Y_train, X_test, Y_test)
NB_score = testModel(BYSModel, X_train, Y_train, X_test, Y_test)
SVM_score = testModel(SVMModel, X_train, Y_train, X_test, Y_test)

接着照[8]进行Stacking,却怎么也调不通。
[9]介绍了一个库vecstack,封装了stacking的第一步,试试。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from vecstack import stacking
from sklearn.metrics import accuracy_score

# 进行stacking
# 用评分前五名的决策树,随机森林,KNN,朴素贝叶斯,支持向量机算法进行stacking。
models = [
DecisionTreeClassifier(),
RandomForestClassifier(),
KNeighborsClassifier(n_neighbors = 4),
GaussianNB(),
SVC()
]
Y_train.replace({"Sue":1.0, "Mark":2.0, "Kate":3.0, "Bob":4.0}, inplace = True)
Y_test.replace({"Sue":1.0, "Mark":2.0, "Kate":3.0, "Bob":4.0}, inplace = True)
x_train, x_test, y_train, y_test = train_test_split(X_train, Y_train, test_size=0.2, random_state=0)
S_train, S_test = stacking(models, x_train, y_train, x_test, regression=False, mode='oof_pred_bag', needs_proba=False, save_dir=None, metric=accuracy_score, n_folds=4, stratified=True, shuffle=True, random_state=0, verbose=2)
# 第二级,用梯度下降分类
model = SGDClassifier()
model = model.fit(S_train, y_train)
y_pred = model.predict(S_test)
print('Final prediction score: [%.8f]' % accuracy_score(y_test, y_pred))

#分别测试
DT_score = testModel(treeModel, X_train, Y_train, X_test, Y_test)
RF_score = testModel(forestModel, X_train, Y_train, X_test, Y_test)
KNN_score = testModel(knnModel, X_train, Y_train, X_test, Y_test)
NB_score = testModel(BYSModel, X_train, Y_train, X_test, Y_test)
SVM_score = testModel(SVMModel, X_train, Y_train, X_test, Y_test)
stacking_score = testModel(model, S_train, y_train, S_test, y_test)
print("Stacking结果")
stacking_results = pd.DataFrame({
'模型': ["决策树", "随机森林","KNN","朴素贝叶斯", "支持向量机", "Stacking"],
'预测正确率': [DT_score, RF_score, KNN_score, NB_score, SVM_score, stacking_score]})
print(stacking_results.sort_values(by='预测正确率', ascending=False))


换了一个二级模型,用逻辑回归模型

stacking的结果并不是最好的,跟选用的模型有很大关系。先熟悉流程吧。
本篇代码: https://github.com/zwdnet/MyQuant/tree/master/40

参考:
[1]https://zhuanlan.zhihu.com/p/32949396
[2]https://blog.csdn.net/wstcjf/article/details/77989963
[3] https://zhuanlan.zhihu.com/p/33589222
[4] https://zhuanlan.zhihu.com/p/32896968
[5] https://zhuanlan.zhihu.com/p/26890738
[6] https://zhuanlan.zhihu.com/p/75411512
[7] https://zhuanlan.zhihu.com/p/27493821
[8] https://www.cnblogs.com/yucaodie/p/7044737.html
[9] https://towardsdatascience.com/automate-stacking-in-python-fc3e7834772e

我发文章的三个地方,欢迎大家在朋友圈等地方分享,欢迎点“在看”。
我的个人博客地址:https://zwdnet.github.io
我的知乎文章地址: https://www.zhihu.com/people/zhao-you-min/posts
我的微信个人订阅号:赵瑜敏的口腔医学学习园地

欢迎打赏!感谢支持!