数据理解与清洗
丹丹

数据集载入

1
2
3
4
5
6
7
8
9
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv(u"2019-08-01_金融数据描述_data1.csv",encoding = 'gbk')
1
df.head()
Unnamed: 0 custid trade_no bank_card_no low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day ... loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
0 5 2791858 20180507115231274000000023057383 卡号1 0.01 0.99 0 0.90 0.55 0.313 ... 2900.0 1688.0 1200.0 75.0 1.0 2.0 1200.0 1200.0 12.0 18.0
1 10 534047 20180507121002192000000023073000 卡号1 0.02 0.94 2000 1.28 1.00 0.458 ... 3500.0 1758.0 15100.0 80.0 5.0 6.0 22800.0 9360.0 4.0 2.0
2 12 2849787 20180507125159718000000023114911 卡号1 0.04 0.96 0 1.00 1.00 0.114 ... 1600.0 1250.0 4200.0 87.0 1.0 1.0 4200.0 4200.0 2.0 6.0
3 13 1809708 20180507121358683000000388283484 卡号1 0.00 0.96 2000 0.13 0.57 0.777 ... 3200.0 1541.0 16300.0 80.0 5.0 5.0 30000.0 12180.0 2.0 4.0
4 14 2499829 20180507115448545000000388205844 卡号1 0.01 0.99 0 0.46 1.00 0.175 ... 2300.0 1630.0 8300.0 79.0 2.0 2.0 8400.0 8250.0 22.0 120.0

5 rows × 90 columns

1
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB
1
2
3
4
5
6
7
typedic = {} # 类型字典
for name in df.columns:
typedic[str(df[name].dtype)] = typedic.get(str(df[name].dtype),[])+[name]

for key,value in typedic.items():
print("{}格式共有{}个: {}".format(key,len(value),value))
print("")
int64格式共有13个: ['Unnamed: 0', 'custid', 'take_amount_in_later_12_month_highest', 'repayment_capability', 'is_high_user', 'historical_trans_amount', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'status']

object格式共有7个: ['trade_no', 'bank_card_no', 'reg_preference_for_trad', 'source', 'id_name', 'latest_query_time', 'loans_latest_time']

float64格式共有70个: ['low_volume_percent', 'middle_volume_percent', 'trans_amount_increase_rate_lately', 'trans_activity_month', 'trans_activity_day', 'transd_mcc', 'trans_days_interval_filter', 'trans_days_interval', 'regional_mobility', 'student_feature', 'number_of_trans_from_2011', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'avg_consume_less_12_valid_month', 'top_trans_count_last_1_month', 'avg_price_top_last_12_valid_month', 'trans_top_time_last_1_month', 'trans_top_time_last_6_month', 'consume_top_time_last_1_month', 'consume_top_time_last_6_month', 'cross_consume_count_last_1_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'consume_mini_time_last_1_month', 'max_consume_count_later_6_month', 'railway_consume_count_last_12_month', 'jewelry_consume_count_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'apply_credibility', 'query_org_count', 'query_finance_count', 'query_cash_count', 'query_sum_count', 'latest_one_month_apply', 'latest_three_month_apply', 'latest_six_month_apply', 'loans_score', 'loans_credibility_behavior', 'loans_count', 'loans_settle_count', 'loans_overdue_count', 'loans_org_count_behavior', 'consfin_org_count_behavior', 'loans_cash_count', 'latest_one_month_loan', 'latest_three_month_loan', 'latest_six_month_loan', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_long_time', 'loans_credit_limit', 'loans_credibility_limit', 'loans_org_count_current', 'loans_product_count', 'loans_max_limit', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_credibility', 'consfin_org_count_current', 'consfin_product_count', 'consfin_max_limit', 'consfin_avg_limit', 'latest_query_day', 'loans_latest_day']
1
df.columns
Index(['Unnamed: 0', 'custid', 'trade_no', 'bank_card_no',
       'low_volume_percent', 'middle_volume_percent',
       'take_amount_in_later_12_month_highest',
       'trans_amount_increase_rate_lately', 'trans_activity_month',
       'trans_activity_day', 'transd_mcc', 'trans_days_interval_filter',
       'trans_days_interval', 'regional_mobility', 'student_feature',
       'repayment_capability', 'is_high_user', 'number_of_trans_from_2011',
       'first_transaction_time', 'historical_trans_amount',
       'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month',
       'avg_consume_less_12_valid_month', 'abs',
       'top_trans_count_last_1_month', 'avg_price_last_12_month',
       'avg_price_top_last_12_valid_month', 'reg_preference_for_trad',
       'trans_top_time_last_1_month', 'trans_top_time_last_6_month',
       'consume_top_time_last_1_month', 'consume_top_time_last_6_month',
       'cross_consume_count_last_1_month',
       'trans_fail_top_count_enum_last_1_month',
       'trans_fail_top_count_enum_last_6_month',
       'trans_fail_top_count_enum_last_12_month',
       'consume_mini_time_last_1_month',
       'max_cumulative_consume_later_1_month',
       'max_consume_count_later_6_month',
       'railway_consume_count_last_12_month',
       'pawns_auctions_trusts_consume_last_1_month',
       'pawns_auctions_trusts_consume_last_6_month',
       'jewelry_consume_count_last_6_month', 'status', 'source',
       'first_transaction_day', 'trans_day_last_12_month', 'id_name',
       'apply_score', 'apply_credibility', 'query_org_count',
       'query_finance_count', 'query_cash_count', 'query_sum_count',
       'latest_query_time', 'latest_one_month_apply',
       'latest_three_month_apply', 'latest_six_month_apply', 'loans_score',
       'loans_credibility_behavior', 'loans_count', 'loans_settle_count',
       'loans_overdue_count', 'loans_org_count_behavior',
       'consfin_org_count_behavior', 'loans_cash_count',
       'latest_one_month_loan', 'latest_three_month_loan',
       'latest_six_month_loan', 'history_suc_fee', 'history_fail_fee',
       'latest_one_month_suc', 'latest_one_month_fail', 'loans_long_time',
       'loans_latest_time', 'loans_credit_limit', 'loans_credibility_limit',
       'loans_org_count_current', 'loans_product_count', 'loans_max_limit',
       'loans_avg_limit', 'consfin_credit_limit', 'consfin_credibility',
       'consfin_org_count_current', 'consfin_product_count',
       'consfin_max_limit', 'consfin_avg_limit', 'latest_query_day',
       'loans_latest_day'],
      dtype='object')

可视化各特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
name_list_1 = [
'low_volume_percent', 'middle_volume_percent',
'take_amount_in_later_12_month_highest',
'trans_amount_increase_rate_lately', 'trans_activity_month',
'trans_activity_day', 'transd_mcc', 'trans_days_interval_filter',
'trans_days_interval', 'regional_mobility', 'student_feature',
'repayment_capability', 'is_high_user', 'number_of_trans_from_2011',
'first_transaction_time'
]
name_list_2 = [
'historical_trans_amount',
'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month',
'avg_consume_less_12_valid_month', 'abs',
'top_trans_count_last_1_month', 'avg_price_last_12_month',
'avg_price_top_last_12_valid_month', 'reg_preference_for_trad',
'trans_top_time_last_1_month', 'trans_top_time_last_6_month',
'consume_top_time_last_1_month', 'consume_top_time_last_6_month',
'cross_consume_count_last_1_month',
]
name_list_3 = [
'trans_fail_top_count_enum_last_1_month',
'trans_fail_top_count_enum_last_6_month',
'trans_fail_top_count_enum_last_12_month',
'consume_mini_time_last_1_month',
'max_cumulative_consume_later_1_month',
'max_consume_count_later_6_month',
'railway_consume_count_last_12_month',
'pawns_auctions_trusts_consume_last_1_month',
'pawns_auctions_trusts_consume_last_6_month',
'jewelry_consume_count_last_6_month',
'first_transaction_day', 'trans_day_last_12_month', 'apply_score',
'apply_credibility', 'query_org_count',
]
name_list_4 = [
'query_finance_count', 'query_cash_count', 'query_sum_count',
'latest_query_time', 'latest_one_month_apply',
'latest_three_month_apply', 'latest_six_month_apply', 'loans_score',
'loans_credibility_behavior', 'loans_count', 'loans_settle_count',
'loans_overdue_count', 'loans_org_count_behavior',
'consfin_org_count_behavior', 'loans_cash_count',
]
name_list_5 = [
'latest_one_month_loan', 'latest_three_month_loan',
'latest_six_month_loan', 'history_suc_fee', 'history_fail_fee',
'latest_one_month_suc', 'latest_one_month_fail', 'loans_long_time',
'loans_latest_time', 'loans_credit_limit', 'loans_credibility_limit',
'loans_org_count_current', 'loans_product_count', 'loans_max_limit',
'loans_avg_limit',
]
name_list_6 = [
'consfin_credit_limit', 'consfin_credibility',
'consfin_org_count_current', 'consfin_product_count',
'consfin_max_limit', 'consfin_avg_limit', 'latest_query_day',
'loans_latest_day'
]
1
2
3
4
5
6
7
f, ax = plt.subplots(5, 3, figsize=(20, 20))

for i,name in enumerate(name_list_1):
sns.countplot(data=df, x=name, hue='status', ax=ax[i // 3][i % 3])

plt.savefig("../img/2019-08-01_金融数据描述_1.png")
plt.close()

1
2
3
4
5
6
7
f, ax = plt.subplots(5, 3, figsize=(20, 20))

for i,name in enumerate(name_list_2):
sns.countplot(data=df, x=name, hue='status', ax=ax[i // 3][i % 3])

plt.savefig("../img/2019-08-01_金融数据描述_2.png")
plt.close()

1
2
3
4
5
6
7
f, ax = plt.subplots(5, 3, figsize=(20, 20))

for i,name in enumerate(name_list_3):
sns.countplot(data=df, x=name, hue='status', ax=ax[i // 3][i % 3])

plt.savefig("../img/2019-08-01_金融数据描述_3.png")
plt.close()

1
2
3
4
5
6
7
f, ax = plt.subplots(5, 3, figsize=(20, 20))

for i,name in enumerate(name_list_4):
sns.countplot(data=df, x=name, hue='status', ax=ax[i // 3][i % 3])

plt.savefig("../img/2019-08-01_金融数据描述_4.png")
plt.close()

1
2
3
4
5
6
7
f, ax = plt.subplots(5, 3, figsize=(20, 20))

for i,name in enumerate(name_list_5):
sns.countplot(data=df, x=name, hue='status', ax=ax[i // 3][i % 3])

plt.savefig("../img/2019-08-01_金融数据描述_5.png")
plt.close()

1
2
3
4
5
6
7
f, ax = plt.subplots(3, 3, figsize=(20, 20))

for i,name in enumerate(name_list_6):
sns.countplot(data=df, x=name, hue='status', ax=ax[i // 3][i % 3])

plt.savefig("../img/2019-08-01_金融数据描述_6.png")
plt.close()

从以上分布图可以看出,连续型数据的特征基本上符合正态分布,指数分布或者泊松分布等,但在[‘rank_trad_1_month’,’trans_top_time_last_1_month’,’trans_top_time_last_6_month’,’consume_top_time_last_1_month’,’consume_top_time_last_6_month’,’loans_credit_limit’,’loans_credibility_limit’,’loans_max_limit’,’consfin_credibility’]这几个特征上出现数据于最左最右位置会出现骤升。从业务层面考虑,暂时未得出满意结论,待确认特征取数逻辑后再分析。
还有,’avg_price_last_12_month’,’consfin_credit_limit’这两个特征虽然是连续型特征,但由图上看,出现标签分段,考虑继续探索,查看是否存在分段。

另外几个离散型特征,如’reg_preference_for_trad’,’regional_mobility’等,在标签status上区别明显,感觉是对模型有用的特征。

1
2
3
4
sns.boxplot(y = 'consfin_credit_limit',x = 'status',data = df)

plt.savefig("../img/2019-08-01_金融数据描述_7.png")
plt.close()

1
2
3
4
sns.boxplot(y = 'avg_price_last_12_month',x = 'status',data = df)

plt.savefig("../img/2019-08-01_金融数据描述_8.png")
plt.close()

1
2
3
4
sns.boxplot(y = 'rank_trad_1_month',x = 'status',data = df)

plt.savefig("../img/2019-08-01_金融数据描述_9.png")
plt.close()

数据预处理

为什么需要处理缺失值?

不同缺失值的处理方式在一定程度上影响了特征提取、建模和模型训练缺失值太多,可以尝试着直接删除,如果不删除,处理不好,可能会引来噪声缺失值较少,少于某一缺失率时,直接删除又会带来信息的损失,此时可以采取适当的填充方式。

缺失率大于多少时应当抛弃该特征?

一般为70%,但是还要分析该特征与训练目标的重要程度

缺失值填充有哪些方法?

插值填充

  • 特殊值
  • 均值、加权均值(缺点:降低数据的方差,即随机性)
  • 中位数
  • 众数(多用在类别变数)

插补法

  • 随机插补法:从总体中随机抽取某个样本代替缺失样本
  • 多重插补法:通过变量之间的关系对缺失数据进行预测,利用蒙特卡洛方法生成多个完整的数据集,在对这些数据集进行分析,最后对分析结果进行汇总处理
  • 热平台插补:指在非缺失数据集中找到一个与缺失值所在样本相似的样本(匹配样本),利用其中的观测值对缺失值进行插补
  • 拉格朗日差值法
  • 牛顿插值法

列的维度上,如果是连续性,就使用平均值插补,如果是离散性,就使用众数来插补; 行的维度上,引入预测模型,可考虑辅助回归,通过变量间的关系来预测缺失数据

丢弃

如果行和列的缺失达到一定的比例,建议放弃整行或整列数据

模型法

更多时候我们会基于已有的其他字段,将缺失字段作为目标变量进行预测,从而得到最为可能的补全值。如果带有缺失值的列是数值变量,采用回归模型补全;如果是分类变量,则采用分类模型补全

专家补全

对于少量且具有重要意义的数据记录,专家补足也是非常重要的一种途径

不处理

  • 当缺失值对模型的影响不大时,直接在包含空值的数据上进行数据挖掘
  • 很多模型对于缺失值有容忍度或灵活的处理方法,因此在预处理阶段可以不做处理
  • 树模型对于缺失值的敏感度较低,大部分时候可以在数据有缺失时使用
  • 涉及到距离度量(distance measurement)时,如计算两个点之间的距离,缺失数据就变得比较重要。因为涉及到“距离”这个概念,那么缺失值处理不当就会导致效果很差,如K近邻算法(KNN)和支持向量机(SVM)
  • 线性模型的代价函数(loss function)往往涉及到距离(distance)的计算,计算预测值和真实值之间的差别,这容易导致对缺失值敏感
  • 神经网络的鲁棒性强,对于缺失数据不是非常敏感,但一般没有那么多数据可供使用。
  • 贝叶斯模型对于缺失数据也比较稳定,数据量很小的时候首推贝叶斯模型。

时间序列

依情况处理,主要看单独时间字段或时间字段与某些字段的组合属性对目标分析的作用程度,再采取相应方式来进行特征提取比如:可以将时间字段与其他字段属性进行组合,分析每天、每周、每月或特点星期几等情况下特征数据频率信息,总的来说还是得看分析得目标

通常看到的情况是,除了考虑时间序列这个单独的特征外,往往是将时间序列和具有时间属性的特征联合起来分析,查看组合特征的对所需要分析的内容的影响

异常值和离群值

大多数的参数统计数值如均值、标准差、相关系数等,以及基于这些参数的统计分析均对离群值高度敏感。因此离群值的存在会对数据分析造成极大影响

判断

  • 用分位数搞定,看跟平均值的偏差超过几倍标准差
  • LOF算法
  • describe的时候加一个箱型图

处理

  • 对数转换
  • 缩尾
  • 截尾
  • 插值

无用特征删除

将所有object格式的改为数值格式,其中有部分无用特征则可直接删除

由上可看出需处理包含:[‘trade_no’, ‘bank_card_no’, ‘reg_preference_for_trad’, ‘source’, ‘id_name’, ‘latest_query_time’, ‘loans_latest_time’]

1
df['source'].value_counts(dropna = False)
xs    4754
Name: source, dtype: int64
1
2
to_drop = ['trade_no', 'bank_card_no', 'student_feature','cross_consume_count_last_1_month', 'id_name', 'source']
df = df.drop(to_drop, axis=1)

缺失填充

1
df['reg_preference_for_trad'].value_counts(dropna = False)
一线城市    3403
三线城市    1064
境外       150
二线城市     131
其他城市       4
NaN        2
Name: reg_preference_for_trad, dtype: int64
1
2
df['reg_preference_for_trad'].fillna('其他城市',inplace = True)
df['reg_preference_for_trad'].value_counts(dropna = False)
一线城市    3403
三线城市    1064
境外       150
二线城市     131
其他城市       6
Name: reg_preference_for_trad, dtype: int64
1
df.isnull().sum().sort_values(ascending = False)
latest_three_month_apply                      304
query_finance_count                           304
latest_six_month_apply                        304
latest_query_time                             304
query_sum_count                               304
query_cash_count                              304
latest_one_month_apply                        304
query_org_count                               304
apply_credibility                             304
apply_score                                   304
latest_query_day                              304
latest_three_month_loan                       297
loans_score                                   297
loans_credibility_behavior                    297
loans_count                                   297
loans_settle_count                            297
loans_overdue_count                           297
loans_org_count_behavior                      297
consfin_org_count_behavior                    297
loans_cash_count                              297
latest_one_month_loan                         297
loans_latest_day                              297
latest_six_month_loan                         297
loans_product_count                           297
history_suc_fee                               297
consfin_avg_limit                             297
consfin_product_count                         297
consfin_org_count_current                     297
consfin_credibility                           297
consfin_credit_limit                          297
                                             ... 
trans_amount_increase_rate_lately               3
trans_days_interval                             2
regional_mobility                               2
number_of_trans_from_2011                       2
middle_volume_percent                           2
transd_mcc                                      2
trans_activity_day                              2
trans_activity_month                            2
low_volume_percent                              2
historical_trans_day                            2
first_transaction_time                          2
first_transaction_day                           2
rank_trad_1_month                               2
avg_consume_less_12_valid_month                 2
trans_day_last_12_month                         2
top_trans_count_last_1_month                    2
max_cumulative_consume_later_1_month            0
custid                                          0
status                                          0
take_amount_in_later_12_month_highest           0
pawns_auctions_trusts_consume_last_6_month      0
pawns_auctions_trusts_consume_last_1_month      0
reg_preference_for_trad                         0
historical_trans_amount                         0
avg_price_last_12_month                         0
abs                                             0
repayment_capability                            0
is_high_user                                    0
trans_amount_3_month                            0
Unnamed: 0                                      0
Length: 84, dtype: int64

特征间探索

1
2
3
4
5
corr = df[:].corr()
plt.figure(figsize=(15, 10))
sb.heatmap(corr)
plt.savefig("../img/2019-08-01_金融数据描述_10.png")
plt.close()

1
corr
Unnamed: 0 custid low_volume_percent middle_volume_percent take_amount_in_later_12_month_highest trans_amount_increase_rate_lately trans_activity_month trans_activity_day transd_mcc trans_days_interval_filter ... loans_max_limit loans_avg_limit consfin_credit_limit consfin_credibility consfin_org_count_current consfin_product_count consfin_max_limit consfin_avg_limit latest_query_day loans_latest_day
Unnamed: 0 1.000000 -0.007628 0.002861 -0.022844 -0.019728 0.014076 -0.021715 -0.001027 -0.006975 -0.021338 ... 0.012189 -0.004457 0.005967 0.007746 0.012062 0.009583 0.004020 0.008860 0.020774 0.026387
custid -0.007628 1.000000 0.031430 0.090599 -0.030282 -0.001299 -0.118811 -0.216062 -0.147485 0.144513 ... -0.236185 -0.152017 -0.093180 -0.041979 -0.129297 -0.139197 -0.115324 -0.076767 0.099208 0.098628
low_volume_percent 0.002861 0.031430 1.000000 -0.151504 -0.066047 0.005331 0.023895 -0.093754 -0.095780 0.054680 ... -0.140173 -0.128145 -0.083987 -0.032534 -0.074851 -0.067226 -0.096647 -0.094627 0.069301 0.047500
middle_volume_percent -0.022844 0.090599 -0.151504 1.000000 -0.135050 0.001608 -0.031743 -0.055491 -0.094736 0.010146 ... -0.145599 -0.110230 -0.153241 0.003266 -0.007761 -0.014847 -0.169575 -0.174126 -0.001628 -0.003253
take_amount_in_later_12_month_highest -0.019728 -0.030282 -0.066047 -0.135050 1.000000 -0.006699 0.060997 0.112582 0.181435 -0.055318 ... 0.071798 0.038880 0.039328 0.015841 0.021387 0.021337 0.054433 0.040634 -0.000570 -0.010231
trans_amount_increase_rate_lately 0.014076 -0.001299 0.005331 0.001608 -0.006699 1.000000 0.015389 -0.003061 -0.011644 0.013220 ... 0.007119 0.000119 0.028868 0.006174 -0.019901 -0.019603 0.007123 0.040865 -0.001457 0.027221
trans_activity_month -0.021715 -0.118811 0.023895 -0.031743 0.060997 0.015389 1.000000 0.063373 0.069272 -0.091630 ... 0.061847 0.037275 -0.036822 -0.000400 0.003332 0.005534 -0.022388 -0.043764 -0.024161 -0.014880
trans_activity_day -0.001027 -0.216062 -0.093754 -0.055491 0.112582 -0.003061 0.063373 1.000000 0.674547 -0.491304 ... 0.445155 0.202242 0.096622 0.181157 0.553486 0.551140 0.224509 0.085255 -0.188965 -0.299110
transd_mcc -0.006975 -0.147485 -0.095780 -0.094736 0.181435 -0.011644 0.069272 0.674547 1.000000 -0.407126 ... 0.313715 0.165959 0.066250 0.131905 0.371197 0.374660 0.156951 0.057663 -0.120330 -0.211593
trans_days_interval_filter -0.021338 0.144513 0.054680 0.010146 -0.055318 0.013220 -0.091630 -0.491304 -0.407126 1.000000 ... -0.222970 -0.118522 -0.055205 -0.142221 -0.287044 -0.286212 -0.111345 -0.045801 0.069560 0.163433
trans_days_interval -0.006297 0.200145 0.062421 -0.000383 -0.053329 0.000521 -0.151579 -0.542118 -0.447308 0.690819 ... -0.271875 -0.163049 -0.034716 -0.102588 -0.285654 -0.286510 -0.105000 -0.024740 0.102343 0.178628
regional_mobility 0.011694 -0.134926 -0.042939 0.009671 0.024089 -0.026960 -0.219911 0.143697 0.114936 -0.124350 ... 0.099696 0.062712 0.092796 0.054084 0.147120 0.145322 0.091150 0.079245 -0.051839 -0.107139
student_feature 0.014227 0.026639 -0.007327 0.018852 -0.013524 -0.002682 0.004966 0.011653 0.005921 -0.005412 ... 0.021982 0.007634 0.001860 0.009800 0.020368 0.017777 0.000096 0.004799 -0.020274 -0.031571
repayment_capability -0.006726 -0.026254 -0.041522 -0.178436 0.055174 -0.000572 0.024737 0.189494 0.115434 -0.048998 ... 0.128100 0.066266 0.052900 0.007068 0.046341 0.046471 0.063720 0.052120 -0.020964 -0.049951
is_high_user -0.002995 -0.004640 0.018546 -0.029172 0.068763 -0.002017 0.044231 -0.034766 -0.080892 0.146678 ... -0.006073 0.014809 0.001286 -0.000752 -0.020497 -0.017732 0.007764 0.013155 -0.011908 -0.015299
number_of_trans_from_2011 0.016355 -0.276856 -0.031668 -0.084995 0.056183 -0.013609 -0.198322 0.286353 0.262917 -0.181413 ... 0.195444 0.142227 0.105527 0.066191 0.212621 0.215217 0.138342 0.088776 -0.087938 -0.140635
historical_trans_amount 0.002346 -0.179652 -0.117333 -0.255846 0.102959 -0.002527 0.016522 0.526801 0.392433 -0.254563 ... 0.324514 0.165218 0.137554 0.073505 0.288324 0.288420 0.201670 0.138920 -0.087267 -0.156048
historical_trans_day 0.004245 -0.345875 -0.068345 -0.116128 0.112908 -0.006987 0.052108 0.855832 0.603820 -0.424616 ... 0.430792 0.223623 0.111262 0.142560 0.472101 0.473890 0.219427 0.097498 -0.171007 -0.267034
rank_trad_1_month 0.011515 0.116798 0.102996 0.139914 -0.089095 0.028153 -0.017110 -0.517923 -0.481565 0.345540 ... -0.235028 -0.121879 -0.120020 -0.133787 -0.328297 -0.327323 -0.180749 -0.115613 0.139572 0.243659
trans_amount_3_month -0.003329 -0.069252 -0.063526 -0.189438 0.074279 -0.003598 0.010864 0.293735 0.250140 -0.139654 ... 0.186909 0.084463 0.065007 0.033856 0.151395 0.153457 0.094556 0.063275 -0.044342 -0.099134
avg_consume_less_12_valid_month -0.001742 -0.087905 -0.042251 -0.154257 0.031360 0.006201 0.014532 0.081550 0.194729 -0.017535 ... 0.054239 0.010832 0.024610 0.026514 0.103333 0.101638 0.044864 0.011955 -0.069481 -0.090181
abs -0.004830 -0.059940 -0.059972 -0.247023 0.081978 -0.003474 0.038276 0.306292 0.160103 -0.054408 ... 0.152274 0.105389 0.072716 0.009934 0.071542 0.074592 0.102727 0.090651 -0.042772 -0.065586
top_trans_count_last_1_month 0.021195 0.101220 0.037765 0.059410 -0.063713 0.029701 -0.026923 -0.503568 -0.475823 0.381422 ... -0.199205 -0.082819 -0.071977 -0.138804 -0.323863 -0.324179 -0.128756 -0.058697 0.140451 0.247943
avg_price_last_12_month 0.011454 -0.090752 -0.286213 -0.556220 0.129908 -0.004808 -0.006858 0.068127 0.067431 -0.003254 ... 0.204154 0.186553 0.183277 -0.017057 0.019580 0.017570 0.179284 0.205885 -0.033386 -0.020140
avg_price_top_last_12_valid_month -0.018637 0.124158 0.509637 0.369498 -0.123202 -0.030556 0.011839 -0.138470 -0.145411 0.064464 ... -0.316923 -0.285338 -0.172475 -0.002851 -0.087684 -0.080339 -0.197348 -0.195194 0.079399 0.062422
trans_top_time_last_1_month -0.007772 -0.003624 -0.023020 -0.015316 0.028547 -0.021723 0.018931 0.092705 0.143620 -0.127328 ... 0.039372 0.019334 0.005990 0.060674 0.059770 0.056594 0.021747 0.000811 -0.030739 -0.069952
trans_top_time_last_6_month 0.012229 0.063470 0.064055 0.025356 -0.079647 0.032999 -0.016856 -0.519642 -0.400228 0.541678 ... -0.227717 -0.111353 -0.055343 -0.151634 -0.336984 -0.334724 -0.106172 -0.042912 0.088539 0.199473
consume_top_time_last_1_month -0.012425 -0.013828 -0.030325 -0.024816 0.022475 -0.020937 0.021517 0.126348 0.162267 -0.155562 ... 0.053634 0.023311 0.018442 0.073859 0.083782 0.080835 0.038965 0.012452 -0.044495 -0.078152
consume_top_time_last_6_month 0.010297 0.071615 0.026853 0.033989 -0.062038 0.032106 -0.009154 -0.506739 -0.371122 0.489491 ... -0.228478 -0.109085 -0.055605 -0.128403 -0.331808 -0.329305 -0.105383 -0.042412 0.083349 0.194636
cross_consume_count_last_1_month -0.019612 -0.015749 0.056716 -0.076003 0.042310 -0.004524 -0.001588 0.217779 0.236684 -0.098826 ... 0.101269 0.038369 0.032169 0.024935 0.129515 0.131031 0.041163 0.025667 -0.026850 -0.081532
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
loans_score -0.014716 -0.006276 -0.007019 -0.039519 -0.003928 0.010481 -0.020418 -0.211471 -0.141698 0.105373 ... -0.145502 -0.071344 0.076157 -0.043698 -0.242594 -0.249384 -0.004854 0.085289 0.035587 -0.019444
loans_credibility_behavior -0.000191 -0.023587 0.037414 -0.031636 -0.017790 -0.003788 0.002694 -0.059242 -0.028579 0.016791 ... -0.109592 -0.059538 0.113228 0.202241 0.132995 0.141105 0.112848 0.110999 0.014235 -0.034587
loans_count 0.008971 -0.104423 -0.103643 0.093372 0.034853 -0.011057 0.010844 0.713700 0.433028 -0.340780 ... 0.454809 0.167113 0.097746 0.189068 0.669975 0.658099 0.240708 0.067635 -0.194495 -0.388366
loans_settle_count 0.008509 -0.106395 -0.115451 0.084747 0.044691 -0.011398 0.006203 0.723008 0.440667 -0.346732 ... 0.466131 0.175771 0.109141 0.183622 0.633814 0.618343 0.239062 0.079620 -0.198341 -0.374859
loans_overdue_count -0.002123 -0.047870 0.003258 0.055341 -0.018406 -0.009483 0.026175 0.319103 0.193574 -0.145065 ... 0.189896 0.071337 0.001805 0.089767 0.389047 0.396295 0.109103 -0.020359 -0.069116 -0.166293
loans_org_count_behavior 0.002662 -0.107997 -0.113256 0.086219 0.020780 -0.012629 0.014655 0.698811 0.451913 -0.330187 ... 0.462677 0.133557 0.065915 0.231726 0.799828 0.784147 0.257087 0.038651 -0.216354 -0.402870
consfin_org_count_behavior 0.012062 -0.129297 -0.074851 -0.007761 0.021387 -0.019901 0.003332 0.553486 0.371197 -0.287044 ... 0.357549 0.137506 0.102057 0.297390 1.000000 0.988027 0.345697 0.074170 -0.185038 -0.399885
loans_cash_count -0.002987 -0.078110 -0.115566 0.123857 0.016962 -0.006482 0.018475 0.662281 0.420938 -0.298750 ... 0.443326 0.108990 0.034866 0.156549 0.555002 0.539896 0.164961 0.012515 -0.197164 -0.337008
latest_one_month_loan -0.001009 -0.082016 -0.034938 -0.017258 0.008845 -0.010482 0.010030 0.251331 0.152255 -0.109048 ... 0.216387 0.078718 0.057899 0.087776 0.328566 0.334659 0.136317 0.039233 -0.167327 -0.564125
latest_three_month_loan 0.009841 -0.102432 -0.041141 -0.007416 0.009611 -0.012746 0.024060 0.277217 0.180567 -0.132316 ... 0.235019 0.083134 0.055062 0.099953 0.410173 0.415173 0.158521 0.038337 -0.172112 -0.556912
latest_six_month_loan 0.013859 -0.084090 -0.080426 0.080167 0.034499 -0.007440 0.023376 0.618386 0.381795 -0.294128 ... 0.395462 0.110447 0.067585 0.169560 0.653119 0.646769 0.214958 0.040151 -0.201667 -0.457269
history_suc_fee 0.005751 -0.154554 -0.068576 0.038832 0.036804 -0.013452 0.008331 0.707752 0.432147 -0.361614 ... 0.439947 0.158091 0.129951 0.194162 0.682134 0.674063 0.258122 0.092018 -0.197019 -0.392292
history_fail_fee 0.005633 -0.039117 0.009093 0.055486 -0.019751 -0.006480 0.033715 0.269878 0.148111 -0.115664 ... 0.164983 0.047106 0.008807 0.098387 0.399674 0.411005 0.130966 -0.011496 -0.060527 -0.199490
latest_one_month_suc 0.000168 -0.078523 -0.025131 -0.022327 0.022699 -0.009579 0.009873 0.228821 0.146654 -0.111359 ... 0.194397 0.056937 0.023565 0.092642 0.349219 0.360903 0.119473 0.014118 -0.167496 -0.459412
latest_one_month_fail 0.013939 -0.041356 0.021126 -0.018844 -0.004671 -0.005436 0.017951 0.133230 0.080658 -0.053043 ... 0.104707 0.041028 -0.000725 0.053177 0.228943 0.244784 0.078403 -0.011238 -0.051815 -0.232811
loans_long_time 0.000992 -0.216323 -0.074834 0.040822 0.012660 -0.002678 0.030914 0.384281 0.301560 -0.317888 ... 0.368329 0.268476 0.144168 0.229963 0.345921 0.337877 0.198060 0.134422 -0.134214 -0.214274
loans_credit_limit -0.006194 -0.160582 -0.138446 -0.086706 0.050185 0.013118 0.036172 0.229054 0.177929 -0.146462 ... 0.745694 0.901755 0.085200 0.057358 0.151520 0.152471 0.123290 0.087213 -0.111170 -0.157736
loans_credibility_limit -0.033850 -0.000850 -0.051545 0.090824 -0.025575 0.003868 -0.005513 0.161363 0.154721 -0.083218 ... 0.328845 0.456704 -0.022646 -0.029875 0.115503 0.114720 0.021164 -0.036374 -0.127687 -0.138313
loans_org_count_current -0.002987 -0.078110 -0.115566 0.123857 0.016962 -0.006482 0.018475 0.662281 0.420938 -0.298750 ... 0.443326 0.108990 0.034866 0.156549 0.555002 0.539896 0.164961 0.012515 -0.197164 -0.337008
loans_product_count -0.001941 -0.072159 -0.114296 0.122741 0.018572 -0.007575 0.019401 0.656939 0.416980 -0.295409 ... 0.441316 0.108962 0.031936 0.155744 0.553607 0.539424 0.163650 0.009397 -0.199549 -0.348389
loans_max_limit 0.012189 -0.236185 -0.140173 -0.145599 0.071798 0.007119 0.061847 0.445155 0.313715 -0.222970 ... 1.000000 0.716175 0.108760 0.101659 0.357549 0.354231 0.208389 0.109425 -0.190074 -0.282204
loans_avg_limit -0.004457 -0.152017 -0.128145 -0.110230 0.038880 0.000119 0.037275 0.202242 0.165959 -0.118522 ... 0.716175 1.000000 0.091117 0.031452 0.137506 0.137271 0.130307 0.098641 -0.099086 -0.144835
consfin_credit_limit 0.005967 -0.093180 -0.083987 -0.153241 0.039328 0.028868 -0.036822 0.096622 0.066250 -0.055205 ... 0.108760 0.091117 1.000000 0.241837 0.102057 0.081338 0.788637 0.901470 -0.047756 -0.110230
consfin_credibility 0.007746 -0.041979 -0.032534 0.003266 0.015841 0.006174 -0.000400 0.181157 0.131905 -0.142221 ... 0.101659 0.031452 0.241837 1.000000 0.297390 0.288275 0.216232 0.275861 -0.085131 -0.230984
consfin_org_count_current 0.012062 -0.129297 -0.074851 -0.007761 0.021387 -0.019901 0.003332 0.553486 0.371197 -0.287044 ... 0.357549 0.137506 0.102057 0.297390 1.000000 0.988027 0.345697 0.074170 -0.185038 -0.399885
consfin_product_count 0.009583 -0.139197 -0.067226 -0.014847 0.021337 -0.019603 0.005534 0.551140 0.374660 -0.286212 ... 0.354231 0.137271 0.081338 0.288275 0.988027 1.000000 0.329687 0.054292 -0.183685 -0.402011
consfin_max_limit 0.004020 -0.115324 -0.096647 -0.169575 0.054433 0.007123 -0.022388 0.224509 0.156951 -0.111345 ... 0.208389 0.130307 0.788637 0.216232 0.345697 0.329687 1.000000 0.790580 -0.096890 -0.193945
consfin_avg_limit 0.008860 -0.076767 -0.094627 -0.174126 0.040634 0.040865 -0.043764 0.085255 0.057663 -0.045801 ... 0.109425 0.098641 0.901470 0.275861 0.074170 0.054292 0.790580 1.000000 -0.042131 -0.088072
latest_query_day 0.020774 0.099208 0.069301 -0.001628 -0.000570 -0.001457 -0.024161 -0.188965 -0.120330 0.069560 ... -0.190074 -0.099086 -0.047756 -0.085131 -0.185038 -0.183685 -0.096890 -0.042131 1.000000 0.230078
loans_latest_day 0.026387 0.098628 0.047500 -0.003253 -0.010231 0.027221 -0.014880 -0.299110 -0.211593 0.163433 ... -0.282204 -0.144835 -0.110230 -0.230984 -0.399885 -0.402011 -0.193945 -0.088072 0.230078 1.000000

82 rows × 82 columns

参考:

  1. DataWhale数据挖掘实战营
  2. 利用Python进行数据分析·第2版
  3. Stata:离群值!离群值?离群值!
  4. 机器学习中如何处理不平衡数据?