网格搜寻
丹丹

数据集载入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

df = pd.read_csv(u"2019-08-01_金融数据描述_data1.csv",encoding = 'gbk')

# 删除无用
delete = ['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no','id_name','latest_query_time','source','loans_latest_time','first_transaction_time', 'student_feature']
df = df.drop(delete,axis=1)

# 处理分类型特征
df['reg_preference_for_trad'] = LabelEncoder().fit_transform(df['reg_preference_for_trad'].astype(str))

# 使用众数填充
for i in range(df.shape[1]):
feature = df.iloc[:,i].values.reshape(-1,1)
imp_mode = Imputer(strategy='most_frequent')
df.iloc[:,i] = imp_mode.fit_transform(feature)

# 数据划分
X = df[:].drop("status",axis=1)
y = df["status"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = False)

# 数据归一化
X_train = minmax_scale(X_train)
X_test = minmax_scale(X_test)

Simple Grid Search(使用Holdout验证)

交叉验证的目的是为了让模型评估更加准确可信。

缺点:

  • 最终的表现好坏与初始数据的划分结果有很大的关系,为了处理这种情况,可以采用交叉验证的方式来减少偶然性
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

print("Size of training set:{} size of testing set:{}".format(X_train.shape[0],X_test.shape[0]))

#### grid search start
best_score = 0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C) #对于每种参数可能的组合,进行一次训练;
svm.fit(X_train,y_train)
score = svm.score(X_test,y_test)
if score > best_score: #找到表现最好的参数
best_score = score
best_parameters = {'gamma':gamma,'C':C}
### grid search end

print("Best score:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
Size of training set:3803 size of testing set:951
Best score:0.78
Best parameters:{'gamma': 0.1, 'C': 10}

原始数据集划分成训练集和测试集以后,其中测试集除了用作调整参数,也用来测量模型的好坏;这样做导致最终的评分结果比实际效果要好,错估模型的泛化能力。

解决方法是可以对训练集再进行一次划分,分成训练集和验证集,这样划分的结果就是:原始数据划分为3份,分别为:训练集、验证集和测试集;其中训练集用来模型训练,验证集用来调整参数,而测试集用来衡量模型表现好坏。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
X_trainval,y_trainval = X_train,y_train
X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval)
print("Size of training set:{} size of validation set:{} size of teseting set:{}".format(X_train.shape[0],X_val.shape[0],X_test.shape[0]))

best_score = 0.0
for gamma in [0.001,0.01,0.1,1,10,100]:
for C in [0.001,0.01,0.1,1,10,100]:
svm = SVC(gamma=gamma,C=C)
svm.fit(X_train,y_train)
score = svm.score(X_val,y_val)
if score > best_score:
best_score = score
best_parameters = {'gamma':gamma,'C':C}
svm = SVC(**best_parameters)
svm.fit(X_trainval,y_trainval)
test_score = svm.score(X_test,y_test)
print("Best score on validation set:{:.2f}".format(best_score))
print("Best parameters:{}".format(best_parameters))
print("Best score on test set:{:.2f}".format(test_score))
Size of training set:2852 size of validation set:951 size of teseting set:951
Best score on validation set:0.79
Best parameters:{'gamma': 0.01, 'C': 100}
Best score on test set:0.76

Grid Search with Cross Validation(使用K折交叉验证)

Grid Search 调参方法存在的共性弊端就是:耗时;参数越多,候选值越多,耗费时间越长。
所以,一般情况下,先定一个大范围,然后再细化。

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import GridSearchCV

param_grid = {"gamma":[0.001,0.01,0.1,1,10,100],
"C":[0.001,0.01,0.1,1,10,100]}
print("Parameters:{}".format(param_grid))

grid_search = GridSearchCV(SVC(),param_grid,cv=5)
grid_search.fit(X_train,y_train)

print("Test set score:{:.2f}".format(grid_search.score(X_test,y_test)))
print("Best parameters:{}".format(grid_search.best_params_))
print("Best score on train set:{:.2f}".format(grid_search.best_score_))
Parameters:{'gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}
Test set score:0.77
Best parameters:{'C': 10, 'gamma': 0.1}
Best score on train set:0.79

参考:

  1. DataWhale数据挖掘实战营
  2. 调参必备—Grid Search网格搜索
  3. 交叉验证和网格搜索