量化投资学习笔记38——机器学习实操

《Python机器学习应用》的最后一部分内容
强化学习
是程序或智能体通过与环境不断地进行交互学习从环境到动作的映射,学习的目标就是使累计回报最大化。
具体算法有马尔科夫决策,蒙特卡洛强化学习,Q-learning算法,深度强化学习,Deep Q Network。
具体的例子是一个游戏,用到pygame,手机上用不了,就pass了。
接下来就开始看书以及在问题中学吧。
今天开始复工了,学习时间会少一些。
拿个项目来实操一下机器学习的整个过程,就拿kaggle上面的泰坦尼克号来练手吧。
本文代码: https://github.com/zwdnet/MyQuant/tree/master/38
先照国外大神的文章来吧。[1]
这个kernel主要关注于特征工程。
背景就不介绍了
1.先加载数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 将训练数据和训练数据合并到一起
def concat_df(train_data, test_data):
return pd.concat([train_data, test_data], sort = True).reset_index(drop = True)


# 将数据集重新分割为训练集和测试集
def divide_df(all_data):
return all_data.loc[:890], all_data.loc[891:].drop(["Survived"], axis = 1)


if __name__ == "__main__":
# 载入数据
df_train = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")
df_all = concat_df(df_train, df_test)

df_train.name = "Training Set"
df_test.name = "Test Set"
df_all.name = "All Set"

dfs = [df_train, df_test]

print("训练样本量 = {}".format(df_train.shape[0]))
print("测试样本量 = {}".format(df_test.shape[0]))
print("训练中的X的形状 = {}".format(df_train.shape))
print("训练中的y的形状 = {}".format(df_train["Survived"].shape[0]))
print("测试中的X的形状 = {}".format(df_test.shape))
print("测试中的y的形状 = {}".format(df_test.shape[0]))
print(df_train.columns)
print(df_test.columns)


将训练集和测试集合并到一个DataFrame里。
2.进行探索性数据分析
概览
具体数据列为:
PassengerId:乘客编号,每行唯一,对结果无影响。
Survived:乘客生存情况,我们要预测的值,有1或0两个情况,1是活,0是死。
Pclass:乘客等级,分三级,1是上层等级,2是中等阶级,3是下层阶级。
Name,Sex,Age不言自明。
SibSp:是乘客的同胞兄弟姐妹和配偶的总数。
Parch:乘客的父母和子女总数。
Ticket:乘客的票号。
Fare:乘客的票价。
Cabin:乘客的舱号。
Embarked:乘客登船港口,有C,Q,S三个。
查看数据情况。

1
2
3
# 查看数据情况
print(df_train.info())
print(df_train.sample(3))

1
2
print(df_test.info())
print(df_test.sample(3))


Age,Fare,Embark,Cabin有缺失值。
处理缺失值
处理缺失值的时候将训练集和测试集的数据合并到一起很方便,不然填充的数据可能对训练集或测试集形成过拟合。Age,Embark和Fare缺失值的比例不高,而Cabin有80%的缺失值。前三者可以使用描述统计学方法填充缺失值,而Cabin不行。
处理年龄缺失值。
用年龄中位数。不是全部乘客的年龄中位数,而是根据乘客等级分类的年龄中位数。因为其与年龄和生存情况有很高的相关系数。

1
2
3
4
# 计算年龄与其它特征的相关性
df_all_corr = df_all.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_all_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
print(df_all_corr[df_all_corr["Feature 1"] == "Age"])


为了更加精确,用性别作为填充年龄缺失值时的第二级分类。当乘客等级上升时,无论男女其年龄中位数都上升。而女性的年龄中位数稍微低于男性。

1
2
3
4
5
6
7
# 按等级和性别分组计算年龄中位数
age_by_pclass_sex = df_all.groupby(["Sex", "Pclass"]).median()["Age"]

for pclass in range(1, 4):
for sex in ["female", "male"]:
print("分等级乘客年龄中位数为{} {}s: {}".format(pclass, sex, age_by_pclass_sex[sex][pclass]))
print("所有乘客的年龄中位数为: {}".format(df_all["Age"].median()))

1
2
3
# 用各组的年龄中位数填充缺失值
df_all["Age"] = df_all.groupby(["Sex", "Pclass"])["Age"].apply(lambda x : x.fillna(x.median()))
print(df_all["Age"].isnull().sum())

处理登船地点缺失值
仅有两个缺失值

1
2
# 查看Embarked缺失值信息
print(df_all[df_all["Embarked"].isnull()])


都是女性,等级票号都一样,说明她们彼此认识。跟她们一样等级的女性的典型登船地点是C,但不意味着她们也如此。谷歌二位的名字,发现二者是主仆关系,从S登船。于是用S填充。

1
2
3
4
5
6
    # 根据搜索的真实值用S填充Embarked
df_all["Embarked"] = df_all["Embarked"].fillna('S')
处理票价缺失值
# 处理Fare缺失值
# 输出缺失值情况
print(df_all[df_all["Fare"].isnull()])


只有一个缺失值,可以用与其相同的等级和家庭人数的男性乘客的票价的中位数来填充。
最后处理舱位数据的缺失值
舱位的数据处理比较棘手,因为缺失数据较多,而且其与生存率有关系,不能完全抛弃。
Cabin数据的第一个字母代表了舱位的位置,舱位大部分只供某一等级的乘客使用,但也有部分舱位是多个等级混合使用的。
如图

在船的甲板上有6个房间,被标为T、U、W、X、Y、Z。但只有T舱在数据集中出现。
A、B、C舱只有第一等级的乘客。
D、E舱各种等级的乘客都有。
F、G舱有二三等级的乘客。
从A舱到G舱,与楼梯的距离增加,这可能是影响生存率的一个因素。
下面画一下乘客的舱位分布图,M代表缺失值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# 计算每个等级的乘客在每个舱位的数量
def get_pclass_dist(df):
deck_counts = {'A': {}, 'B': {}, 'C': {}, 'D': {}, 'E': {}, 'F': {}, 'G': {}, 'M': {}, 'T': {}}
decks = df.columns.levels[0]

for deck in decks:
for pclass in range(1, 4):
try:
count = df[deck][pclass][0]
deck_counts[deck][pclass] = count
except KeyError:
deck_counts[deck][pclass] = 0
df_decks = pd.DataFrame(deck_counts)
deck_percentages = {}

# 计算每个乘客等级在每个客舱的比例
for col in df_decks.columns:
deck_percentages[col] = [(count/df_decks[col].sum()) * 100 for count in df_decks[col]]

return deck_counts, deck_percentages


# 绘图显示等级舱位距离
def display_pclass_dist(percentages):
df_percentages = pd.DataFrame(percentages).transpose()
deck_names = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'M', 'T')
bar_count = np.arange(len(deck_names))
bar_width = 0.85

pclass1 = df_percentages[0]
pclass2 = df_percentages[1]
pclass3 = df_percentages[2]

plt.figure(figsize = (20, 10))
plt.bar(bar_count, pclass1, color = '#b5ffb9', edgecolor = "white",width = bar_width, label = "Passenger Class 1")
plt.bar(bar_count, pclass2, bottom = pclass1, color = "#f9bc86", edgecolor = "white",width = bar_width, label = "Passenger Class 2")
plt.bar(bar_count, pclass3, bottom = pclass1 + pclass2, color = "#a3acff", edgecolor = "white",width = bar_width, label = "Passenger Class 3")

plt.xlabel("Deck", size = 15, labelpad = 20)
plt.xlabel("Passenger Class Percentage", size = 15, labelpad = 20)
plt.xticks(bar_count, deck_names)
plt.tick_params(axis = "x", labelsize = 15)
plt.tick_params(axis = "y", labelsize = 15)
plt.legend(loc="upper left",bbox_to_anchor=(1, 1), prop={'size': 15})
plt.title("Passenger Class Distribution in Decks", size=18, y=1.05)
plt.savefig("pclassdeck.png")

# 画图看看每个客舱的乘客等级比例
all_deck_count, all_deck_per = get_pclass_dist(df_all_decks)
display_pclass_dist(all_deck_per)


越往后,低等级乘客的比例越大。
其中,A、B、C三个船舱全部乘客为等级1的乘客。船舱D的乘客87%为一等级,13%为二等级。船舱E83%为一等级,10%为二等级,7%为三等级。船舱F为62%二等级和38%三等级。G舱的全部是三等级的乘客。有一位乘客在T舱,T舱接近A舱,他是一等级的乘客,因此分组时被分到A舱。舱位数据缺失的标记为”M”,由于不太可能得知这些乘客的真实舱位,因此把M也作为一个舱位。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# 计算每个船舱的生存比例
def get_survived_dist(df):
surv_counts = {'A':{}, 'B':{}, 'C':{}, 'D':{}, 'E':{}, 'F':{}, 'G':{}, 'M':{}}
decks = df.columns.levels[0]

for deck in decks:
for survive in range(0, 2):
surv_counts[deck][survive] = df[deck][survive][0]

df_surv = pd.DataFrame(surv_counts)
surv_percentages = {}

for col in df_surv.columns:
surv_percentages[col] = [(count / df_surv[col].sum()) * 100 for count in df_surv[col]]

return surv_counts, surv_percentages


# 绘制每个船舱乘客生存率图
def display_surv_dist(percentages):
df_survived_percentages = pd.DataFrame(percentages).transpose()
deck_names = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'M')
bar_count = np.arange(len(deck_names))
bar_width = 0.85

not_survived = df_survived_percentages[0]
survived = df_survived_percentages[1]

plt.figure(figsize=(20, 10))
plt.bar(bar_count, not_survived, color='#b5ffb9', edgecolor='white', width=bar_width, label="Not Survived")
plt.bar(bar_count, survived, bottom=not_survived, color='#f9bc86', edgecolor='white', width=bar_width, label="Survived")
plt.xlabel('Deck', size=15, labelpad=20)
plt.ylabel('Survival Percentage', size=15, labelpad=20)
plt.xticks(bar_count, deck_names)
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)

plt.legend(loc='upper left', bbox_to_anchor=(1, 1), prop={'size': 15})
plt.title('Survival Percentage in Decks', size=18, y=1.05)

plt.savefig("CabinSurvived.png")

# 计算每个客舱的乘客生存率,绘图
df_all_decks_survived = df_all.groupby(['Deck', 'Survived']).count().drop(columns=['Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Pclass', 'Cabin', 'PassengerId', 'Ticket']).rename(columns={'Name':'Count'}).transpose()
all_surv_count, all_surv_per = get_survived_dist(df_all_decks_survived)
display_surv_dist(all_surv_per)


每个船舱的乘客生存率不一样,B、C、D、E舱的生存率最高。乘客多为一等级的。M舱的生存率最低,其乘客多为二三等级的。
因此可将数据分为ABC(只有一等乘客),DE,FG(乘客组成类似),M组。

1
2
3
4
5
# 将客舱数据按组成比例分组
df_all["Deck"] = df_all["Deck"].replace(['A', 'B', 'C'], 'ABC')
df_all["Deck"] = df_all["Deck"].replace(['D', 'E'], 'DE')
df_all["Deck"] = df_all["Deck"].replace(['F', 'G'], 'FG')
print(df_all["Deck"].value_counts())


现在,我们已经用Deck特征代替了Cabin,可以将Cabin丢弃了。

1
2
3
4
5
6
# 划分训练集和测试集
df_train, df_test = divide_df(df_all)
dfs = [df_train, df_test]

for df in dfs:
print(df.info())


搞定,没有缺失值了。
再来看看目标值的分布,在训练集中获救率38.38%(342/891),死亡率61.62%(549/891)。
画图看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 目标值的分布
survived = df_train["Survived"].value_counts()[1]
not_survived = df_train["Survived"].value_counts()[0]
survived_per = survived / df_train.shape[0] * 100
not_survived_per = not_survived / df_train.shape[0] * 100
print('{}名乘客中的{}名获救,占训练集的{:.2f}%。'.format(df_train.shape[0], survived, survived_per))
print('{}名乘客中的{}名遇难,占训练集的{:.2f}%。'.format(df_train.shape[0], not_survived, not_survived_per))

plt.figure(figsize=(10, 8))
sns.countplot(df_train["Survived"])
plt.xlabel("Survival", size = 15, labelpad = 15)
plt.ylabel('Passenger Count', size=15, labelpad=15)
plt.xticks((0, 1), ['Not Survived ({0:.2f}%)'.format(not_survived_per), 'Survived ({0:.2f}%)'.format(survived_per)])
plt.tick_params(axis='x', labelsize=13)
plt.tick_params(axis='y', labelsize=13)

plt.title("Training Set Survival Distribution")
plt.savefig("surviveddist.png")


分析特征之间的相关性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 分析特征间的相关性
df_train_corr = df_train.drop(["PassengerId"], axis = 1).corr().abs().unstack().sort_values(kind = "quicksort", ascending = False).reset_index()
df_train_corr.rename(columns = {"level_0": "Feature 1", "level_1": "Feature 2", 0: "Correlation Coefficient"}, inplace = True)
df_train_corr.drop(df_train_corr.iloc[1::2].index, inplace = True)
df_train_corr_nd = df_train_corr.drop(df_train_corr[df_train_corr["Correlation Coefficient"] == 1.0].index)

df_test_corr = df_test.drop(["PassengerId"], axis = 1).corr().abs().unstack().sort_values(kind = "quicksort", ascending = False).reset_index()
df_test_corr.rename(columns = {"level_0": "Feature 1", "level_1": "Feature 2", 0: "Correlation Coefficient"}, inplace = True)
df_test_corr.drop(df_test_corr.iloc[1::2].index, inplace = True)
df_test_corr_nd = df_test_corr.drop(df_test_corr[df_test_corr["Correlation Coefficient"] == 1.0].index)

# 训练集的高相关性
corr = df_train_corr_nd["Correlation Coefficient"] > 0.1
print(df_train_corr_nd[corr])
# 测试集的高相关性
corr = df_test_corr_nd["Correlation Coefficient"] > 0.1
print(df_test_corr_nd[corr])


训练集和测试集都是船票价格和等级相关性最大。
画图看看。

1
2
3
4
5
6
7
8
9
10
11
12
# 绘相关性图
fig = plt.figure(figsize = (20, 20))
sns.heatmap(df_train.drop(["PassengerId"], axis = 1).corr(), annot = True, square = True, cmap = "coolwarm", annot_kws = {"size" : 14})
plt.tick_params(axis = "x", labelsize = 14)
plt.title("Training Set Correlations", size = 15)
plt.savefig("TrainFeatureCorr.png")

fig = plt.figure(figsize = (20, 20))
sns.heatmap(df_test.drop(["PassengerId"], axis = 1).corr(), annot = True, square = True, cmap = "coolwarm", annot_kws = {"size" : 14})
plt.tick_params(axis = "y", labelsize = 14)
plt.title("Testing Set Correlations", size = 15)
plt.savefig("TestFeatureCorr.png")



目标值在特征中的分布
先看连续型特征,有两个,年龄和票价。它们有良好的分割点用于构建决策树。一个潜在的问题是它们在训练集中的分布与在测试集中不同,后者更平滑。
画图看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 研究连续型特征的分布
cont_features = ["Age", "Fare"]
surv = df_train["Survived"] == 1
fig, axs = plt.subplots(ncols = 1, nrows = 6, figsize = (15, 15))
plt.subplots_adjust(right = 1.5)

# 特征中的获救人数分布
sns.distplot(df_train[~surv]["Age"], label = "Not Survived", hist = True, color='#e74c3c', ax=axs[0])
axs[0].set_title("Age_Survived dist")
sns.distplot(df_train[surv]["Fare"], label='Survived', hist=True, color='#2ecc71', ax=axs[1])
axs[1].set_title("Fare_Survived dist")
# 数据集中的获救人数分布
sns.distplot(df_train["Age"], label='Training Set', hist=False, color='#e74c3c', ax=axs[2])
axs[2].set_title("TrainSetAge_Survived dist")
sns.distplot(df_test["Age"], label='Test Set', hist=False, color='#2ecc71', ax=axs[3])
axs[3].set_title("TestSetAge_Survived dist")

sns.distplot(df_train["Fare"], label='Training Set', hist=False, color='#e74c3c', ax=axs[4])
axs[4].set_title("TrainSetFare_Survived dist")
sns.distplot(df_test["Fare"], label='Test Set', hist=False, color='#2ecc71', ax=axs[5])
axs[5].set_title("TestSetFare_Survived dist")

plt.savefig("feature_dist.png")


年龄特征的获救率分布显示低于15岁组的获救率高于其它组,船票特征的获救率分布显示在分布的末端有较高的获救率。
再来研究分类变量。
每个分类变量至少有一个类别有较高的生存率,这对预测很有帮助。其中Pclass和Sex是最好的分类特征,因为其分布的匀质性。
还是画图看看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 研究分类特征
cat_features = ['Embarked', 'Parch', 'Pclass', 'Sex', 'SibSp', 'Deck']

fig, axs = plt.subplots(ncols=2, nrows=3, figsize=(20, 20))
plt.subplots_adjust(right=1.5, top=1.25)

for i, feature in enumerate(cat_features, 1):
plt.subplot(2, 3, i)
sns.countplot(x=feature, hue='Survived', data=df_train)

plt.xlabel('{}'.format(feature), size=20, labelpad=15)
plt.ylabel('Passenger Count', size=20, labelpad=15)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)

plt.legend(['Not Survived', 'Survived'], loc='upper center', prop={'size': 18})
plt.title('Count of Survival in {} Feature'.format(feature), size=20, y=1.05)

plt.savefig("cat_feature_dist.png")


从S港口登船的乘客获救率最低,从C港登船的乘客有超过半数获救。有一个家庭成员的乘客获救率最高。
分析数据的结论:特征之间相关性很高,因此可以通过特征转换产生新的特征。连续型特征可以通过决策树模型进行划分。分类变量的不同分类之间的生存率有很大差异。这些特征可以进行one-hot编码。一些特征可以彼此联合形成新的特征。在数据探索阶段生成了一个新的特征”Deck”并放弃了”Cabin”特征。
重新将数据合并为一个。

1
2
3
# 重新将数据合并
df_all = concat_df(df_train, df_test)
print(df_all.head())


数据清洗探索阶段就完了,接下来进行特征工程。
3.特征工程
连续特征
票价
将票价分为13个组,画图看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 将Fare划分为13组,画图
df_all["Fare"] = pd.qcut(df_all["Fare"], 13)
# 绘图
fig, axs = plt.subplots(figsize=(22, 9))
sns.countplot(x = "Fare", hue = "Survived", data = df_all)
plt.xlabel('Fare', size=15, labelpad=20)
plt.ylabel('Passenger Count', size=15, labelpad=20)
plt.tick_params(axis='x', labelsize=10)
plt.tick_params(axis='y', labelsize=15)

plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 15})
plt.title('Count of Survival in {} Feature'.format('Fare'), size=15, y=1.05)

plt.savefig("FE_fare.png")


左侧的生还率最低,右侧的生还率最高。
中间有一组(15.742, 23.25]有异常。
年龄,年龄符合正态分布,将年龄分为十组,第一组有最高的获救率,第四组的获救率最低。有一个异常组(34.0, 40.0]的获救率偏高。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 # 将Age划分为10组,画图
df_all["Age"] = pd.qcut(df_all["Age"], 10)
# 绘图
fig, axs = plt.subplots(figsize=(22, 9))
sns.countplot(x = "Age", hue = "Survived", data = df_all)
plt.xlabel('Age', size=15, labelpad=20)
plt.ylabel('Passenger Count', size=15, labelpad=20)
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)

plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 15})
plt.title('Count of Survival in {} Feature'.format('Age'), size=15, y=1.05)

plt.savefig("FE_age.png")


频率编码
建立Family_Size特征,通过将SibSp+Parch+1来得到。SibSp是兄弟姐妹数量,Parch是父母子女的数量,相加即为乘客在船上的亲人的总数。最后加1,是乘客本人。值为1标为”Alone”,值为2,3,4标为”Small”,值为5,6标为”Medium”,值为7,8,11标为”Large”。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 创建Family_Size特征 画图分析
df_all["Family_Size"] = df_all["SibSp"] + df_all["Parch"] + 1
fig, axs = plt.subplots(figsize=(10, 10), ncols=2, nrows=2)
#plt.subplots_adjust(right = 1.5)

sns.barplot(x=df_all['Family_Size'].value_counts().index, y=df_all['Family_Size'].value_counts().values, ax=axs[0][0])
sns.countplot(x='Family_Size', hue='Survived', data=df_all, ax=axs[0][1])

axs[0][0].set_title('Family Size Feature Value Counts', size=10, y=1.05)
axs[0][1].set_title('Survival Counts in Family Size ', size=10, y=1.05)

family_map = {1: 'Alone', 2: 'Small', 3: 'Small', 4: 'Small', 5: 'Medium', 6: 'Medium', 7: 'Large', 8: 'Large', 11: 'Large'}
df_all['Family_Size_Grouped'] = df_all['Family_Size'].map(family_map)

sns.barplot(x=df_all['Family_Size_Grouped'].value_counts().index,y=df_all['Family_Size_Grouped'].value_counts().values, ax=axs[1][0])
sns.countplot(x='Family_Size_Grouped', hue='Survived', data=df_all, ax=axs[1][1])

axs[1][0].set_title('Family Size Feature Value Counts After Grouping', size=10, y=1.05)
axs[1][1].set_title('Survival Counts in Family Size After Grouping', size=10, y=1.05)

for i in range(2):
axs[i][1].legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 10})
for j in range(2):
axs[i][j].tick_params(axis='x', labelsize=10)
axs[i][j].tick_params(axis='y', labelsize=10)
axs[i][j].set_xlabel('')
axs[i][j].set_ylabel('')
plt.savefig("FE_family.png")


有很多Ticket的异常值需要分析,因此将它们按照频率分组使事情更加简单。
很多人成群出行,比如朋友,主仆等,他们并不被划分为一家人,但是使用同样的Ticket。
按照有相同Ticket的人数分组,画图看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
# Ticket_Frequency特征,绘图
df_all["Ticket_Frequency"] = df_all.groupby("Ticket")["Ticket"].transform("count")
fig, axs = plt.subplots(figsize = (12, 9))
sns.countplot(x = "Ticket_Frequency", hue = "Survived", data = df_all)
plt.xlabel('Ticket Frequency', size=15, labelpad=20)
plt.ylabel('Passenger Count', size=15, labelpad=20)
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)

plt.legend(['Not Survived', 'Survived'], loc='upper right', prop={'size': 15})
plt.title('Count of Survival in {} Feature'.format('Ticket Frequency'), size=15, y=1.05)

plt.savefig("FE_ticket.png")


由图可知,有相同Ticket的人数为2,3,4时获救率最高,超过4时,获救率急剧下降。
称谓及婚否
称谓(Title)是根据姓名前缀新建的特征。姓名前有很多前缀,将Miss, Mrs, Ms, Mlle, Lady, Mme, the Countess, Dona替换为Miss/Mrs/Ms,Dr, Col, Major, Jonkheer, Capt, Sir, Don 和 Rev替换为Dr/Military/Noble/Clergy。
Is_Married是基于姓名Mrs前缀的二元特征。这是所有女性中获救率最高的特征。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 根据姓名前缀生成Title和Is_Married特征并分析。
df_all["Title"] = df_all["Name"].str.split(', ', expand = True)[1].str.split('.', expand = True)[0]
df_all["Is_Married"] = 0
df_all["Is_Married"].loc[df_all["Title"] == "Mrs"] = 1

fig, axs = plt.subplots(nrows=2, figsize=(20, 20))
sns.barplot(x=df_all['Title'].value_counts().index, y=df_all['Title'].value_counts().values, ax=axs[0])

axs[0].tick_params(axis='x', labelsize=10)
axs[1].tick_params(axis='x', labelsize=15)

for i in range(2):
axs[i].tick_params(axis='y', labelsize=15)

axs[0].set_title('Title Feature Value Counts', size=20, y=1.05)

df_all['Title'] = df_all['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')
df_all['Title'] = df_all['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Dr/Military/Noble/Clergy')

sns.barplot(x=df_all['Title'].value_counts().index, y=df_all['Title'].value_counts().values, ax=axs[1])
axs[1].set_title('Title Feature Value Counts After Grouping', size=20, y=1.05)

plt.savefig("FE_title.png")
plt.close()


用extract_surname来提取姓名中的姓。根据姓来创建Family特征,然后根据相同家庭来对乘客分组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 根据姓名提取姓
def extract_surname(data):
families = []

for i in range(len(data)):
name = data.iloc[i]

if '(' in name:
name_no_bracket = name.split('(')[0]
else:
name_no_bracket = name

family = name_no_bracket.split('(')[0]
title = name_no_bracket.split(',')[1].strip().split(' ')[0]

# 将符号用空格代替
for c in string.punctuation:
family = family.replace(c, '').strip()

families.append(family)

return families

# 根据Name特征建立Family特征
df_all["Family"] = extract_surname(df_all["Name"])
df_train, df_test = tools.divide_df(df_all)
dfs = [df_train, df_test]

Family_Survival_Rate是从训练集的Family特征中来的。因为测试集没有获救信息。建立一个在训练集和测试集中都存在的家庭名称的列表,计算这些家庭的获救率,保存于Family_Survival_Rate特征中。
再建一个Family_Survival_Rate_NA特征,用于只在测试集中存在的家庭。他们的家庭生存率无法计算。
Ticket_Survived_Rate和Ticket_Survived_Rate_NA特征用同样的方法计算。Ticket_Survived_Rate和Family_Survived_Rate计算平均值成为Survival_Rate,Ticket_Survived_Rate_NA和Family_Survival_Rate_NA也计算平均值保存于Survival_Rate_NA中。
具体见代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# 检查同时在训练集和测试集中且成员数大于1的家庭
for i in range(len(df_family_survival_rate)):
if df_family_survival_rate.index[i] in non_unique_families and df_family_survival_rate.iloc[i, 1] > 1:
family_rates[df_family_survival_rate.index[i]] = df_family_survival_rate.iloc[i, 0]

for i in range(len(df_ticket_survival_rate)):
if df_ticket_survival_rate.index[i] in non_unique_tickets and df_ticket_survival_rate.iloc[i, 1] > 1:
ticket_rates[df_ticket_survival_rate.index[i]] = df_ticket_survival_rate.iloc[i, 0]

mean_survival_rate = np.mean(df_train["Survived"])

train_family_survival_rate = []
train_family_survival_rate_NA = []
test_family_survival_rate = []
test_family_survival_rate_NA = []

for i in range(len(df_train)):
if df_train["Family"][i] in family_rates:
train_family_survival_rate.append(family_rates[df_train["Family"][i]])
train_family_survival_rate_NA.append(1)
else:
train_family_survival_rate.append(mean_survival_rate)
train_family_survival_rate_NA.append(0)

for i in range(len(df_test)):
if df_test["Family"].iloc[i] in family_rates:
test_family_survival_rate.append(family_rates[df_test["Family"].iloc[i]])
test_family_survival_rate_NA.append(1)
else:
test_family_survival_rate.append(mean_survival_rate)
test_family_survival_rate_NA.append(0)

df_train["Family_Survival_Rate"] = train_family_survival_rate
df_train["Family_Survival_Rate_NA"] = train_family_survival_rate_NA
df_test["Family_Survival_Rate"] = test_family_survival_rate
df_test["Family_Survival_Rate_NA"] = test_family_survival_rate_NA

train_ticket_survival_rate = []
train_ticket_survival_rate_NA = []
test_ticket_survival_rate = []
test_ticket_survival_rate_NA = []

for i in range(len(df_train)):
if df_train["Ticket"][i] in ticket_rates:
train_ticket_survival_rate.append(ticket_rates[df_train["Ticket"][i]])
train_ticket_survival_rate_NA.append(1)
else:
train_ticket_survival_rate.append(mean_survival_rate)
train_ticket_survival_rate_NA.append(0)

for i in range(len(df_test)):
if df_test["Ticket"].iloc[i] in ticket_rates:
test_ticket_survival_rate.append(ticket_rates[df_test["Ticket"].iloc[i]])
test_ticket_survival_rate_NA.append(1)
else:
test_ticket_survival_rate.append(mean_survival_rate)
test_ticket_survival_rate_NA.append(0)

df_train["Ticket_Survival_Rate"] = train_ticket_survival_rate
df_train["Ticket_Survival_Rate_NA"] = train_ticket_survival_rate_NA
df_test["Ticket_Survival_Rate"] = test_ticket_survival_rate
df_test["Ticket_Survival_Rate_NA"] = test_ticket_survival_rate_NA

for df in [df_train, df_test]:
df["Survival_Rate"] = (df["Ticket_Survival_Rate"] + df["Family_Survival_Rate"]) / 2
df["Survival_Rate_NA"] = (df["Ticket_Survival_Rate_NA"] + df["Family_Survival_Rate_NA"]) / 2

特征工程最后是进行特征转换。
先标记编码非数值特征,主要是Embarked,Sex,Deck,Title和Family_Size_Grouped等特征,Age和Fare是分类特征。使用LabelEncoder将其转换为数值类型。

1
2
3
4
5
# 使用LabelEncoder将分类特征转换为数值类型
non_numeric_features = ["Embarked", "Sex", "Deck", "Title", "Family_Size_Grouped", "Age", "Fare"]
for df in dfs:
for feature in non_numeric_features:
df[feature] = LabelEncoder().fit_transform(df[feature])

最后,使用独热编码(One-Hot Encoding)处理分类特征。
使用OneHotEncoder。之所以不直接赋值为1,2,3,……是因为分类器往往默认数据是连续的,并且是有序的。
具体代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 使用独热编码处理分类特征
cat_features = ["Pclass", "Sex", "Deck", "Embarked", "Title", "Family_Size_Grouped"]
encoded_features = []

for df in dfs:
for feature in cat_features:
encoded_feat = OneHotEncoder().fit_transform(df[feature].values.reshape(-1,1)).toarray()
n = df[feature].nunique()
cols = ["{}_{}".format(feature, n) for n in range(1, n+1)]
encoded_df = pd.DataFrame(encoded_feat, columns = cols)
encoded_df.index = df.index
encoded_features.append(encoded_df)

df_train = pd.concat([df_train, *encoded_features[:6]], axis = 1)
df_test = pd.concat([df_test, *encoded_features[6:]], axis = 1)

特征工程小结:将Age和Fare进行了分组,通过Parch和SibSp组合生成了Family_Size,Ticket_Frequency显示了Ticket的出现频率。Name特征非常有用,衍生出几个特征。最后是处理了一些分类变量。
现在把数据组合起来,输出看看。

1
2
3
4
5
6
# 将数据组合,保留有用的特征
df_all = tools.concat_df(df_train, df_test)
drop_cols = ['Deck', 'Embarked', 'Family', 'Family_Size', 'Family_Size_Grouped', 'Survived', 'Name', 'Parch', 'PassengerId', 'Pclass', 'Sex', 'SibSp', 'Ticket', 'Title', 'Ticket_Survival_Rate','Family_Survival_Rate', 'Ticket_Survival_Rate_NA', 'Family_Survival_Rate_NA']
df_all.drop(columns = drop_cols, inplace = True)
print(df_all.head())
print(df_all.info())


4.建模
终于到了最激动人心的一步了,开始建模。
先处理数据

1
2
3
4
5
6
7
8
9
# 划分数据
df_train, df_test = tools.divide_df(df_all)
X_train = StandardScaler().fit_transform(df_train.drop(["Survived"], axis = 1))
y_train = df_train["Survived"].values
X_test = StandardScaler().fit_transform(df_test)

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))

结果
X_train shape: (891, 26)
y_train shape: (891,)
X_test shape: (418, 26)
采用随机树森林模型,建立两个模型,一个是单独的模型,另一个是k折叠交叉验证的模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# ④建模
# 划分数据
df_train, df_test = tools.divide_df(df_all)
X_train = StandardScaler().fit_transform(df_train.drop(["Survived"], axis = 1))
y_train = df_train["Survived"].values
X_test = StandardScaler().fit_transform(df_test)

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))

single_best_model = RFC(criterion = "gini", n_estimators = 1100, max_depth = 5, min_samples_split=4, min_samples_leaf=5, max_features='auto', oob_score=True, random_state=SEED, n_jobs=-1, verbose=1)
leaderboard_model = RFC(criterion = "gini", n_estimators = 1750, max_depth = 7, min_samples_split=6, min_samples_leaf=6, max_features='auto', oob_score=True, random_state=SEED, n_jobs=-1, verbose=1)
N = 5
oob = 0
probs = pd.DataFrame(np.zeros((len(X_test), N*2)), columns = ['Fold_{}_Prob_{}'.format(i, j) for i in range(1, N + 1) for j in range(2)])
df_temp = df_all.drop(["Survived"], axis = 1)
importances = pd.DataFrame(np.zeros((X_train.shape[1], N)), columns=['Fold_{}'.format(i) for i in range(1, N + 1)], index=df_temp.columns)
fprs, tprs, scores = [], [], []
skf = StratifiedKFold(n_splits=N, random_state=N, shuffle=True)

for fold, (trn_idx, val_idx) in enumerate(skf.split(X_train, y_train), 1):
print('Fold {}\n'.format(fold))

# 模型拟合
leaderboard_model.fit(X_train[trn_idx], y_train[trn_idx])

# 计算训练的AUC分数
trn_fpr, trn_tpr, trn_thresholds = roc_curve(y_train[trn_idx], leaderboard_model.predict_proba(X_train[trn_idx])[:, 1])
trn_auc_score = auc(trn_fpr, trn_tpr)
# 计算检验的AUC分数
val_fpr, val_tpr, val_thresholds = roc_curve(y_train[val_idx], leaderboard_model.predict_proba(X_train[val_idx])[:, 1])
val_auc_score = auc(val_fpr, val_tpr)

scores.append((trn_auc_score, val_auc_score))
fprs.append(val_fpr)
tprs.append(val_tpr)

# X_test概率
probs.loc[:, 'Fold_{}_Prob_0'.format(fold)] = leaderboard_model.predict_proba(X_test)[:, 0]
probs.loc[:, 'Fold_{}_Prob_1'.format(fold)] = leaderboard_model.predict_proba(X_test)[:, 1]
importances.iloc[:, fold - 1] = leaderboard_model.feature_importances_

oob += leaderboard_model.oob_score_ / N
print('Fold {} OOB Score: {}\n'.format(fold, leaderboard_model.oob_score_))

print('Average OOB Score: {}'.format(oob))


运行结果,平均评分为0.84。这段说实话没太懂,原文照抄了。
画图看看。先画出特征在模型中的重要性。

1
2
3
4
5
6
7
8
9
10
11
12
13
# 画图看看
importances['Mean_Importance'] = importances.mean(axis=1)
importances.sort_values(by='Mean_Importance', inplace=True, ascending=False)

plt.figure(figsize=(15, 20))
sns.barplot(x='Mean_Importance', y=importances.index, data=importances)

plt.xlabel('')
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.title('Random Forest Classifier Mean Feature Importance Between Folds', size=15)

plt.savefig("RandomForest.png")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# 画ROC曲线
def plot_roc_curve(fprs, tprs):

tprs_interp = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
f, ax = plt.subplots(figsize=(15, 15))

# 为每次折叠测试画ROC曲线并计算AUC值
for i, (fpr, tpr) in enumerate(zip(fprs, tprs), 1):
tprs_interp.append(np.interp(mean_fpr, fpr, tpr))
tprs_interp[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
ax.plot(fpr, tpr, lw=1, alpha=0.3, label='ROC Fold {} (AUC = {:.3f})'.format(i, roc_auc))

# 为随机猜测画ROC图
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', alpha=0.8, label='Random Guessing')

mean_tpr = np.mean(tprs_interp, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)

# 画平均ROC值
ax.plot(mean_fpr, mean_tpr, color='b', label='Mean ROC (AUC = {:.3f} $\pm$ {:.3f})'.format(mean_auc, std_auc), lw=2, alpha=0.8)

# 画平均ROC的标准差
std_tpr = np.std(tprs_interp, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
ax.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2, label='$\pm$ 1 std. dev.')

ax.set_xlabel('False Positive Rate', size=15, labelpad=20)
ax.set_ylabel('True Positive Rate', size=15, labelpad=20)
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=15)
ax.set_xlim([-0.05, 1.05])
ax.set_ylim([-0.05, 1.05])

ax.set_title('ROC Curves of Folds', size=20, y=1.02)
ax.legend(loc='lower right', prop={'size': 13})

plt.savefig("ROC.png")

plot_roc_curve(fprs, tprs)


最后,进行预测,提交kaggle。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 预测结果提交
class_survived = [col for col in probs.columns if col.endswith('Prob_1')]
probs['1'] = probs[class_survived].sum(axis=1) / N
probs['0'] = probs.drop(columns=class_survived).sum(axis=1) / N
probs['pred'] = 0
pos = probs[probs['1'] >= 0.5].index
probs.loc[pos, 'pred'] = 1

y_pred = probs['pred'].astype(int)

submission_df = pd.DataFrame(columns=['PassengerId', 'Survived'])
submission_df['PassengerId'] = df_test['PassengerId']
submission_df['Survived'] = y_pred.values
submission_df.to_csv('submissions.csv', header=True, index=False)
print(submission_df.head(10))

提交到kaggle看看


0.80分,排1222名。比我自己之前做的都好了。再看看怎么改进吧。
总结一下跟着大神的kernel走一遍的收获:
①进行数据清洗和特征工程之前要将训练数据和测试数据合并到一起,防止二者之间出现偏差。
②处理缺失数据,要根据数据的特性选择合适的方法。比如年龄数据的缺失值,根据姓名称呼用不同年龄段的乘客的中位数就要比简单用所有乘客的年龄中位数要准确一些。也可以获取一些额外的数据,比如通过谷歌搜索用真实值来填充缺失数据。
③相关领域的知识越多越有利于特征工程。比如泰坦尼克号船舱分布的知识,用来处理Cabin的缺失数据。
④进行数据分析和特征工程时可以根据数据特征新建新的属性以更好的整合数据的信息,使其规律更加明显。
⑤分类特征可以用独热编码One-Hot-Encoding进行转换,这样能形成连续有序的数据。
接下来,再参考其它文章进行一些改进吧。
先照这篇文章写一个画模型的学习曲线的函数[2]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 绘制模型的学习曲线
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1,train_sizes=np.linspace(.1, 1.0, 5),verbose=0):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv,n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes,test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")

plt.legend(loc="best")
return plt

# ⑤模型评估
rf_parameters = {"criterion":"gini", "n_estimators":1750, "max_depth":7, "min_samples_split":6, "min_samples_leaf":6, "max_features":'auto', "oob_score":True, "random_state":SEED, "n_jobs":-1, "verbose":1}
title = "RandomForest"
df_train, df_test = tools.divide_df(df_all)
X_train = StandardScaler().fit_transform(df_train.drop(["Survived"], axis = 1))
y_train = df_train["Survived"].values
plt = plot_learning_curve(RFC(**rf_parameters), title, X_train, y_train, cv=None, n_jobs=-1, train_sizes=[50, 100, 150, 200, 250, 350, 400, 450, 500])
plt.savefig("learningCurve.png")


接下来就测试不同的模型了。
根据这篇文章[3],这个问题属于有监督学习的分类问题,可以使用的算法有:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.linear_model import LogisticRegression    #逻辑回归
from sklearn.svm import SVC, LinearSVC #支持向量机
from sklearn.ensemble import RandomForestClassifier #随机森林
from sklearn.neighbors import KNeighborsClassifier #K最邻近算法
from sklearn.naive_bayes import GaussianNB #朴素贝叶斯
from sklearn.linear_model import Perceptron #感知机算法
from sklearn.linear_model import SGDClassifier #梯度下降分类
from sklearn.tree import DecisionTreeClassifier #决策树算法
from sklearn.model_selection import StratifiedKFold #K折交叉切分
from sklearn.model_selection import GridSearchCV #网格搜索

import tools
import numpy as np
import pandas as pd

挨个试吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# 模型测试
def ModelTest(Model, X_train, Y_train):
Model.fit(X_train, Y_train)
# 对模型评分
acc_result = round(Model.score(X_train, Y_train)*100, 2)
return acc_result


# 尝试各种模型
def model_compare(df_all):
# 划分数据
train_df, test_df = tools.divide_df(df_all)
X_train = train_df.drop(["Survived", "PassengerId"], axis = 1)
Y_train = train_df["Survived"]
X_test = test_df
print(X_train.shape, Y_train.shape, X_test.shape)

# 逻辑回归模型
LogModel = LogisticRegression()
acc_log = ModelTest(LogModel, X_train, Y_train)
print("逻辑回归结果:{}".format(acc_log))

# SVM支持向量机模型
SVMModel = SVC()
acc_svc = ModelTest(SVMModel, X_train, Y_train)
print("支持向量机结果:{}".format(acc_svc))

# knn算法
knnModel = KNeighborsClassifier(n_neighbors = 3)
acc_knn = ModelTest(knnModel, X_train, Y_train)
print("knn结果:{}".format(acc_knn))

# 朴素贝叶斯模型
BYSModel = GaussianNB()
acc_bys = ModelTest(BYSModel, X_train, Y_train)
print("朴素贝叶斯算法结果:{}".format(acc_bys))

# 感知机算法
percModel = Perceptron()
acc_perc = ModelTest(percModel, X_train, Y_train)
print("感知机算法算法结果:{}".format(acc_perc))

# 线性分类支持向量机
lin_svcModel = LinearSVC()
acc_lin_svc = ModelTest(lin_svcModel, X_train, Y_train)
print("线性分类支持向量机算法结果:{}".format(acc_lin_svc))

# 梯度下降分类算法
sgdModel = SGDClassifier()
acc_sgd = ModelTest(sgdModel, X_train, Y_train)
print("梯度下降分类算法结果:{}".format(acc_sgd))

# 决策树算法
treeModel = DecisionTreeClassifier()
acc_tree = ModelTest(treeModel, X_train, Y_train)
print("决策树算法结果:{}".format(acc_tree))

# 随机森林算法
forestModel = RandomForestClassifier()
acc_rand = ModelTest(forestModel, X_train, Y_train)
print("随机森林算法结果:{}".format(acc_rand))

# 模型评分
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC','Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log, acc_rand, acc_bys, acc_perc, acc_sgd, acc_lin_svc, acc_tree]})
print(models.sort_values(by='Score', ascending=False))

# 用决策树模型进行预测
tools.Submission(treeModel, test_df, "decisetree.csv")


几种算法中决策树算法的评分最高,生成预测结果提交试试?

还不如随机森林算法呢。
再画一下各个算法的学习曲线吧。
学习曲线是不同训练集大小,模型在训练集和验证集上的得分变化曲线。也就是以样本数为横坐标,训练和交叉验证集上的得分(如准确率)为纵坐标。learning curve可以帮助我们判断模型现在所处的状态:过拟合(overfiting / high variance) or 欠拟合(underfitting / high bias) 。[4]









再研究下怎么对算法进行交叉验证吧。[5]
所谓交叉验证,指每次训练都使用训练数据的一个划分(或折,fold):一部分作为训练集,一部分作为测试集,进行多次划分多次训练。
在sklearn中如果只是想把数据划分为训练集和测试集,用train_test_split。如果已经有训练集和测试集,希望用训练集训练模型后应用到测试集中,则使用交叉验证的方法来选择模型和参数。具体用cross_val_score。

1
2
3
4
5
# 对模型进行交叉验证
def cross_val(model, X, Y, cv=5):
scores = cross_val_score(model, X, Y, cv=cv)
score = scores.mean()
return score


交叉验证的结果跟上面不一样了,用逻辑回归模型提交一次看看吧。

跟我之前用逻辑回归模型的预测结果对比一下

好了0.005,也就一两个预测结果的差距吧?可见模型评分高的模型未必就一定好。更何况不同的评分方法的排序结果差距很大的。
试试模型融合吧。
参考[6]
先筛选出最重要的几个特征,而不是把所有特征都纳入,避免过拟合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# 找出最重要的几个特征
def get_top_n_features(df_all, top_n_features):
# 划分数据
train_df, test_df = tools.divide_df(df_all)
titanic_train_data_X = train_df.drop(["Survived", "PassengerId"], axis = 1)
titanic_train_data_Y = train_df["Survived"]

# random forest
rf_est = RandomForestClassifier(random_state=0)
rf_param_grid = {'n_estimators': [500], 'min_samples_split': [2, 3], 'max_depth': [20]}
rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=-1, cv=10, verbose=1)
rf_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best RF Params:' + str(rf_grid.best_params_))
print('Top N Features Best RF Score:' + str(rf_grid.best_score_))
print('Top N Features RF Train Score:' + str(rf_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_rf = pd.DataFrame({'feature':list(titanic_train_data_X), 'importance': rf_grid.best_estimator_.feature_importances_}).sort_values('importance',ascending=False)
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
print('Sample 10 Features from RF Classifier')
print(str(features_top_n_rf[:10]))

# AdaBoost
ada_est =AdaBoostClassifier(random_state=0)
ada_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1]}
ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=-1, cv=10, verbose=1)
ada_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best Ada Params:' + str(ada_grid.best_params_))
print('Top N Features Best Ada Score:' + str(ada_grid.best_score_))
print('Top N Features Ada Train Score:' + str(ada_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_ada = pd.DataFrame({'feature':list(titanic_train_data_X), 'importance': ada_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
print('Sample 10 Feature from Ada Classifier:')
print(str(features_top_n_ada[:10]))

# ExtraTree
et_est = ExtraTreesClassifier(random_state=0)
et_param_grid = {'n_estimators': [500], 'min_samples_split': [3, 4], 'max_depth': [20]}
et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=-1, cv=10, verbose=1)
et_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best ET Params:' + str(et_grid.best_params_))
print('Top N Features Best ET Score:' + str(et_grid.best_score_))
print('Top N Features ET Train Score:' + str(et_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_et = pd.DataFrame({'feature':list(titanic_train_data_X), 'importance': et_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_et = feature_imp_sorted_et.head(top_n_features)['feature']
print('Sample 10 Features from ET Classifier:')
print(str(features_top_n_et[:10]))

# GradientBoosting
gb_est =GradientBoostingClassifier(random_state=0)
gb_param_grid = {'n_estimators': [500], 'learning_rate': [0.01, 0.1], 'max_depth': [20]}
gb_grid = model_selection.GridSearchCV(gb_est, gb_param_grid, n_jobs=-1, cv=10, verbose=1)
gb_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best GB Params:' + str(gb_grid.best_params_))
print('Top N Features Best GB Score:' + str(gb_grid.best_score_))
print('Top N Features GB Train Score:' + str(gb_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_gb = pd.DataFrame({'feature':list(titanic_train_data_X), 'importance': gb_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
print('Sample 10 Feature from GB Classifier:')
print(str(features_top_n_gb[:10]))

# DecisionTree
dt_est = DecisionTreeClassifier(random_state=0)
dt_param_grid = {'min_samples_split': [2, 4], 'max_depth': [20]}
dt_grid = model_selection.GridSearchCV(dt_est, dt_param_grid, n_jobs=-1, cv=10, verbose=1)
dt_grid.fit(titanic_train_data_X, titanic_train_data_Y)
print('Top N Features Best DT Params:' + str(dt_grid.best_params_))
print('Top N Features Best DT Score:' + str(dt_grid.best_score_))
print('Top N Features DT Train Score:' + str(dt_grid.score(titanic_train_data_X, titanic_train_data_Y)))
feature_imp_sorted_dt = pd.DataFrame({'feature':list(titanic_train_data_X), 'importance': dt_grid.best_estimator_.feature_importances_}).sort_values('importance', ascending=False)
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
print('Sample 10 Features from DT Classifier:')
print(str(features_top_n_dt[:10]))

# merge the three models
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et, features_top_n_gb, features_top_n_dt],ignore_index=True).drop_duplicates()

features_importance = pd.concat([feature_imp_sorted_rf,feature_imp_sorted_ada, feature_imp_sorted_et,feature_imp_sorted_gb, feature_imp_sorted_dt],ignore_index=True)

return features_top_n, features_importance

然后用筛选出来的前10个特征训练模型,进行预测,提交。
结果……

跟使用全部特征的结果是一样的,没啥改进。最后试试模型融合吧。还是参考[6]。
模型融合有Bagging、Boosting、Stacking、Blending等方法。
Bagging 将多个模型,也就是多个基学习器的预测结果进行简单的加权平均或者投票。它的好处是可以并行地训练基学习器。Random Forest就用到了Bagging的思想。
Boosting 的思想有点像知错能改,每个基学习器是在上一个基学习器学习的基础上,对上一个基学习器的错误进行弥补。AdaBoost,Gradient Boost 就用到了这种思想。
Stacking是用新的次学习器去学习如何组合上一层的基学习器。如果把 Bagging 看作是多个基分类器的线性组合,那么Stacking就是多个基分类器的非线性组合。Stacking可以将学习器一层一层地堆砌起来,形成一个网状的结构。
Blending 和 Stacking 很相似,但同时它可以防止信息泄露的问题。
这里我们使用了两层的模型融合,Level 1使用了:RandomForest、AdaBoost、ExtraTrees、GBDT、DecisionTree、KNN、SVM ,一共7个模型,Level 2使用了XGBoost使用第一层预测的结果作为特征对最终的结果进行预测。
如果我们在Train Data上训练,然后在Train Data上预测,就会造成标签。为了避免标签,我们需要对每个基学习器使用K-fold,将K个模型对Valid Set的预测结果拼起来,作为下一层学习器的输入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np

from sklearn.model_selection import KFold

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

import tools



SEED = 0
NFOLDS = 7
kf = KFold(n_splits = NFOLDS, random_state = SEED, shuffle = False)


def get_out_fold(clf, x_train, y_train, x_test):
ntrain = x_train.shape[0]
ntest = x_test.shape[0]
oof_train = np.zeros((ntrain, ))
oof_test = np.zeros((ntest, ))
oof_test_skf = np.empty((NFOLDS, ntest))

for i, (train_index, test_index) in enumerate(kf.split(x_train)):
x_tr = x_train[train_index]
y_tr = y_train[train_index]
x_te = x_train[test_index]

clf.fit(x_tr, y_tr)

oof_train[test_index] = clf.predict(x_te)
oof_test_skf[i, :] = clf.predict(x_test)
oof_test[:] = oof_test_skf.mean(axis=0)
return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)


# 模型融合
def MergeModels(df_all, top_features):
print("模型融合")
# level 1
rf = RandomForestClassifier(n_estimators=500, warm_start=True, max_features='sqrt',max_depth=6, min_samples_split=3, min_samples_leaf=2, n_jobs=-1, verbose=0)
ada = AdaBoostClassifier(n_estimators=500, learning_rate=0.1)
et = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, max_depth=8,min_samples_leaf=2, verbose=0)
gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.008, min_samples_split=3, min_samples_leaf=2, max_depth=5,verbose=0)
dt = DecisionTreeClassifier(max_depth=8)
knn = KNeighborsClassifier(n_neighbors = 2)
svm = SVC(kernel='linear', C=0.025)

train_df, test_df = tools.divide_df(df_all)
x_train = train_df[top_features].values
y_train = train_df["Survived"].values
x_test = test_df[top_features].values

rf_oof_train, rf_oof_test = get_out_fold(rf, x_train, y_train, x_test)
# Random Forest
ada_oof_train, ada_oof_test = get_out_fold(ada, x_train, y_train, x_test)
# AdaBoost
et_oof_train, et_oof_test = get_out_fold(et, x_train, y_train, x_test)
# Extra Trees
gb_oof_train, gb_oof_test = get_out_fold(gb, x_train, y_train, x_test)
# Gradient Boost
dt_oof_train, dt_oof_test = get_out_fold(dt, x_train, y_train, x_test)
# Decision Tree
knn_oof_train, knn_oof_test = get_out_fold(knn, x_train, y_train, x_test)
# KNeighbors
svm_oof_train, svm_oof_test = get_out_fold(svm, x_train, y_train, x_test)
# Support Vector
print("训练完成")
接着是level2,利用XGBoost,使用第一层预测的结果作为特征对最终的结果进行预测。
# level 2 预测并生成提交文件
x_train = np.concatenate((rf_oof_train, ada_oof_train, et_oof_train, gb_oof_train, dt_oof_train, knn_oof_train, svm_oof_train), axis=1)
x_test = np.concatenate((rf_oof_test, ada_oof_test, et_oof_test, gb_oof_test, dt_oof_test, knn_oof_test, svm_oof_test), axis=1)

gbm = XGBClassifier( n_estimators= 2000, max_depth= 4, min_child_weight= 2, gamma=0.9, subsample=0.8,colsample_bytree=0.8, objective= 'binary:logistic', nthread= -1,scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)
StackingSubmission = pd.DataFrame({'PassengerId': test_df["PassengerId"], 'Survived': predictions})
StackingSubmission.to_csv('StackingSubmission.csv',index=False,sep=',')

提交结果看看。

还是不如一开始只使用随机树森林的模型。可见关键问题还是对数据进行特征工程。好了,这篇文章已经够长了,先到这里吧。复工了,学习的时间少了很多。

参考文献
1.https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial
2.https://blog.csdn.net/Koala_Tree/article/details/78725881
3.https://zhuanlan.zhihu.com/p/107958980
4.https://blog.csdn.net/geduo_feng/article/details/79547554
5.https://blog.csdn.net/kamendula/article/details/70318639
6.https://blog.csdn.net/Koala_Tree/article/details/78725881

我发文章的四个地方,欢迎大家在朋友圈等地方分享,欢迎点“在看”。
我的个人博客地址:https://zwdnet.github.io
我的知乎文章地址: https://www.zhihu.com/people/zhao-you-min/posts
我的博客园博客地址: https://www.cnblogs.com/zwdnet/
我的微信个人订阅号:赵瑜敏的口腔医学学习园地

欢迎打赏!感谢支持!