一、案例背景

随着互联网应用的日益普及，网络贷款已成为一种常见的贷款形式。但是，网络借贷行业也存在着很多的风险失控问题。为了规范网络借贷过程管理，加强网络借贷风险控制，业界人士开始采用一些技术手段来规避网络借贷风险。
许多金融服务机构在前期的业务运营中积累了大量的客户数据，如个人基本信息、在机构办理业务信息等数据。本文正是基于数据挖掘技术，综合客户各项信息数据及以往业务中是否违约为分类标签，进行模型训练，经过调整优化后对新的申请用户信息进行分类预测，以预测结果作为核发贷款的重要依据。
在这里插入图片描述
第一步：清洗原始业务数据，得到满足要求的训练数据集；
第二步：选用合适的数据挖掘分类算法进行模型训练（还可以对算法参数进行调优）；
第三步：在模型中输入新用户信息即可得到预测结果。

二、数据处理

本案例数据文件来自于一家网络贷款公司，本文仅选择部分数据进行使用，原数据下载地址：https://www.kaggle.com/husainsb/lendingclub-issued-loans

# 读取数据
dforigin = pd.read_csv('LendingClub10k.csv',sep=',')

dforigin.info()
print(dforigin.shape)

在这里插入图片描述

2.1 选取数据

观察分类标签字段loan_status：

# 显示 loan_status字段
dforigin['loan_status'].value_counts()

在这里插入图片描述
发现除了正常还款、违约之外还有其他分类，由于本文只分析此两类情况，所以接下来直接选取包含这两个标签的数据：

# 显示 loan_status 是 Fully_paid 和 Charged off取值的记录数
df = dforigin.loc[dforigin['loan_status'].isin(['Fully Paid', 'Charged Off'])].copy(deep=True)
print(df['loan_status'].value_counts())
print(df.shape)

在这里插入图片描述
观察违约比例：

# 显示df中正常还款和违约样本比例
print(df['loan_status'].value_counts() / df.shape[0])

在这里插入图片描述
并将标签数据化：

# Fully paid 取值为0； Charged off 取值为 1 
df['loan_status'] = df['loan_status'].apply(lambda s: np.float(s == 'Charged Off'))
print(df['loan_status'].value_counts())
df.rename(columns={'loan_status':'charged_off'}, inplace=True)

在这里插入图片描述

2.2 删除无效数据

对于某些列都是相同的值，其对数据挖掘提供不了有效信息，因此可将其删除：

# 丢弃只有唯一值的列
drop_list = []
for col in df.columns:
    if df[col].nunique() == 1:
        drop_list.append(col)

print(drop_list)
print(df.shape)

df.drop(labels=drop_list, axis=1, inplace=True)
print(df.shape)

在这里插入图片描述
而有些列存在大量缺失数据，严重影响数据分析，也将其舍去：

# 丢弃那些大量缺失值的列
drop_list = []
for col in df.columns:
    if df[col].notnull().sum() / df.shape[0] < 0.02:
        drop_list.append(col)

print(drop_list)
df.drop(labels=drop_list, axis=1, inplace=True)
print(df.shape)

在这里插入图片描述
最后，根据字段的具体意义及个人主观看法，将明显与目标字段（分类标签）无关的数据进行删除：

# 丢弃那些明显与目标无关的列
df.drop(labels=['id', 'emp_title', 'title', 'last_credit_pull_d',
                'earliest_cr_line'], axis=1, inplace=True)

df.drop(labels=['collection_recovery_fee', 'debt_settlement_flag', 
                'last_pymnt_amnt', 'last_pymnt_d', 'recoveries', 
                 'total_pymnt', 'total_pymnt_inv', 'total_rec_int',
                'total_rec_late_fee', 'total_rec_prncp'], axis=1, inplace=True)
df.shape

2.3 分析属性关系

# 考察贷款目的与违约之间的关系
plt.figure(figsize=(12,8))
sns.countplot(y='purpose', hue='charged_off', data=df,
              orient='h',palette = 'BuPu')
plt.yticks(size=18)
plt.ylabel('贷款目的',fontdict={'size':18})
plt.xticks(size=18)
plt.xlabel('业务数量',fontdict={'size':18})
# plt.savefig("ch18_lc01.jpg",dpi=300,bbox_inches="tight")
plt.show()

从图中可以看出，网络借贷的目的大多是为了债务转移和信用卡还贷，同时其违约率也较高，可反映出这样的一个事实：借款人是由于背负贷款被银行等传统机构排除在外，只能通过网络借贷来偿还其贷款。
在这里插入图片描述

# 考察信用分级与违约之间的关系
plt.figure(figsize=(12,8))
sns.countplot(y='sub_grade', hue='charged_off', data=df,
              order=sorted(df['sub_grade'].value_counts().index),
              orient='h',palette = 'BuPu')
plt.ylabel('客户分级',fontdict={'size':18})
plt.xticks(size=18)
plt.xlabel('业务数量',fontdict={'size':18})
# plt.savefig("ch18_lc02.jpg",dpi=300,bbox_inches="tight")
plt.show()

从图中可以看出，客户评级越低，其中违约的人数所占比例越多，甚至存在所处等级的借款人都出现违约的情况，可见等级越低越容易违约。
在这里插入图片描述

#考察贷款期限与违约之间的关系
plt.figure(figsize=(4,4))
sns.countplot(x='term', hue='charged_off', data=df)
# plt.savefig("ch18_lc03.jpg",dpi=300,bbox_inches="tight")
plt.show()

然后是与借款期限的关系，从图中可以看出：60个月的贷款业务其违约率更高，可推断出长期贷款违约率高于短期贷款。
在这里插入图片描述

# 考察FICO评分与违约之间的关系
plt.figure(figsize=(12,6))
sns.kdeplot(df['last_fico_range_high'].loc[df['charged_off']==0], 
            gridsize=500, label='charged_off = 0',linewidth=2,linestyle='--')
sns.kdeplot(df['last_fico_range_high'].loc[df['charged_off']==1], 
            gridsize=500, label='charged_off = 1',linewidth=2,linestyle='-')
plt.xlabel('last_fico_range_high评分',fontdict={'size':18})
plt.ylabel('FICO评分概率密度分布',fontdict={'size':18})
# plt.savefig("ch18_lc04.jpg",dpi=300,bbox_inches="tight")
plt.show()

最后是FICO评分与违约的关系，从图中可以明显看出，违约客户的FICO评分均值远低于正常客户的FICO评分。
在这里插入图片描述

# 多属性相关性分析
corr_charged_off = df.corr()['charged_off']
corr_charged_off.drop(labels='charged_off', inplace=True)
corr_charged_off = corr_charged_off.sort_values()
plt.figure(figsize=(8,28))
sns.barplot(y=corr_charged_off.index, x=corr_charged_off.values, orient='h')
plt.title("Correlation with 'charged_off'")
plt.xlabel("Correlation coefficient with 'charged_off'")
xmax = np.abs(corr_charged_off).max()
plt.xlim([-xmax, xmax])
# plt.savefig("ch18_lc05.jpg",dpi=300,bbox_inches="tight")
plt.show()

最后，可以观察各属性与分类标签之间的关系，如FICO与违约存在极强的负相关性，即FICO分数越高，违约可能性就越低。
在这里插入图片描述

2.4 数据再处理

因为数据中还存在非数据值类型的，如要进行数据挖掘还需将其转化为数值：

# 非数值型数据转换为数值型
text_cols = []
for col in df.columns:
    if df[col].dtype == np.object:
        text_cols.append(col)
print(text_cols)

在这里插入图片描述
然后依次对各列数据进行数值转化：

# 转换term列
df['term'] = df['term'].apply(lambda s:np.float(s[1:3])) 
# There's an extra space in the data for some reason
print(df['term'].value_counts())

在这里插入图片描述

#转换disbursement_method列

DisbMethod_dict = {'Cash':0.0, 'DirectPay':1.0}
def DisbMethod_dict_to_float(s):
    return DisbMethod_dict[s]
df['disbursement_method'] = df['disbursement_method'].apply(lambda s: DisbMethod_dict_to_float(s))
print(df['disbursement_method'].value_counts())

在这里插入图片描述
最后将不重要的数据在此删除，最终得到数据大小为：(4952, 89)

#Some comments in this column, impossible to convert to numeric
df.drop(labels=['desc'], axis=1, inplace=True)
df.drop(labels=['issue_d'], axis=1, inplace=True)
df.shape

三、模型训练

在将数据清洗完成之后，就可以进入下一步的模型训练了

3.1 分离训练集

首先，先将数据的分类标签和各属性分离开来：

X = df.drop(labels=['charged_off'], axis=1) # Features
y = df['charged_off'] # Target variable

然后，分离训练集和测试集：

from sklearn.model_selection import train_test_split
random_state = 12 # I chose this randomly, just to make the results fixed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state)
pd.DataFrame((X_train.notnull().sum() / X_train.shape[0]).sort_values(), columns=['Fraction not null'])

最后，将这四个数据集进行统一的标准化处理：

imputer =SimpleImputer(missing_values=NA, strategy = "mean").fit(X_train)

X_train = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(imputer.transform(X_test),  columns=X_test.columns)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(scaler.transform(X_test),  columns=X_test.columns)

可以得到：

print("df.shape:",df.shape)
print("X_train:",X_train.shape)
print("y_train:",y_train.shape)
print("X_test:",X_test.shape)
print("y_test:",y_test.shape)

在这里插入图片描述

3.2 模型训练

接下来进行数据挖掘模型训练，需要说明的是：由于受到随机抽样和算法数学特性等因素的影响，每次运行结果都可能存在细微的差异。
第一个训练的模型是逻辑回归模型：

#逻辑回归模型
from sklearn.linear_model import LogisticRegression

lrmodel=LogisticRegression(solver='liblinear')   #初始化

start=datetime.datetime.now()
lrmodel.fit(X_train,y_train)   #fit训练模型参数
end=datetime.datetime.now()

y_lrpred=lrmodel.predict(X_test) 

my_eval(y_test, y_lrpred)  
print('Runtime =',end-start)

可以看到准确率达到91%，AUC=0.85，运行时间也比较低。
在这里插入图片描述
其次，是随机森林模型训练：

# 随机森林模型
from sklearn.ensemble import RandomForestClassifier

rfmodel=RandomForestClassifier()

start=datetime.datetime.now()
rfmodel.fit(X_train, y_train)
end=datetime.datetime.now()

y__rfpred=rfmodel.predict(X_test)

my_eval(y_test, y__rfpred)  
print('Runtime =',end-start)

可以看到准确率也很高，AUC值比逻辑回归模型更高，但花费的时间要多一些。
在这里插入图片描述
最后是SGDClassifier模型，它是一系列采用了梯度下降来求解参数的算法的集合，如SVM、logistic、regression等，其相关参数的设置可自行上网搜索：

# SGDClassifier模型
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import matthews_corrcoef, make_scorer
from sklearn.metrics import roc_auc_score

param_grid = [{'loss': ['hinge'],
               'alpha': [10.0**k for k in range(-3,4)],
               'max_iter': [1000],
               'tol': [1e-3],
               'random_state': [random_state],
               'class_weight': [None, 'balanced'],
               'warm_start': [True]},
              {'loss': ['log'],
               'penalty': ['l2', 'l1'],
               'alpha': [10.0**k for k in range(-3,4)],
               'max_iter': [1000],
               'tol': [1e-3],
               'random_state': [random_state],
               'warm_start': [True]}]
grid = GridSearchCV(estimator=SGDClassifier(), param_grid=param_grid, scoring=make_scorer(matthews_corrcoef), 
                    n_jobs=1, pre_dispatch=1, verbose=1, return_train_score=True)

start=datetime.datetime.now()
grid.fit(X_train, y_train)
end=datetime.datetime.now()

y_SGDCpred = grid.predict(X_test)

my_eval(y_test,y_SGDCpred)  
print('Runtime =',end-start)

最终，我们可以看到其AUC有了较大地提升，只是花费了更多的时间来进行训练。
在这里插入图片描述

四、模型预测

最后，我们可以在测试集中随机选取一条用户数据，利用训练好的三个模型对其进行预测，结果可以看到：三个模型预测的结果都正确。

#使用训练好的三种模型进行预测
#从x_test中抽取一条记录，作为模拟的新输入数据
new_input=X_test.iloc[10].values.reshape(1,-1)
print("new_input=",new_input)
new_output=y_test.iloc[10]
print("new_output=",new_output)
#使用=逻辑回归模型对该数据进行预测
print("逻辑回归预测prediction=",lrmodel.predict(new_input))
#使用随机森林模型对该数据进行预测
print("随机森林预测 prediction=",rfmodel.predict(new_input))
#使用SGDClassifier模型对该数据进行预测
print("SGDC模型预测 prediction=",grid.predict(new_input))

在这里插入图片描述

最后的最后，大家如果觉得文章不错的话，记得点赞、收藏、关注三连~
我会把相关数据和完整代码整理好上传到我的资源，大家也可以下载下来自己研究

除此之外，还有以前写的【综合案例】信用卡虚拟交易识别；以及接下来将要写的【综合案例】信用评分模型开发，大家可以关注一下哦~

【综合案例】网络贷款违约预测

目录

一、案例背景

二、数据处理

2.1 选取数据

2.2 删除无效数据

2.3 分析属性关系

2.4 数据再处理

三、模型训练

3.1 分离训练集

3.2 模型训练

四、模型预测

网站公告

今日签到

热门文章

最新发布