作者:张杰

数据挖掘的步骤

  • Data Analysis
  • Feature Engineering
  • Feature Selection
  • Model Building
  • Model Deployment

1. Data Analysis

对于数据分析部分,需要探索以下几个点:
1)Missing Values

2)All The Numerical Variables

3)Distribution of the Numerical Variables

4)Categorical Variables

5)Cardinality of Categorical Variables

6)Outliers

Relationship between independent and dependent feature(SalePrice)

2. Feature Engineering

数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。那特征工程到底是什么呢?顾名思义,其本质是一项工程活动,目的是最大限度地从原始数据中提取特征以供算法和模型使用。

特征工程中可能有以下问题:

  • 不属于同一量纲:即特征的规格不一样,不能够放在一起比较;
  • 定性特征不能直接使用:某些机器学习算法和模型只能接受定量特征的输入,那么需要将定性特征转换为定量特征。最简单的方式是为每一种定性值指定一个定量值,但是这种方式过于灵活,增加了调参的工作。通常使用哑编码的方式将定性特征转换为定量特征:假设有N种特征,当原始特征值为第i种定性值时,第i个扩展特征为1,其他扩展特征赋值为0。哑编码的方式相比直接指定的方式,不用增加调参的工作,对于线性模型来说,使用哑编码的特征可达到非线性的效果;
  • 存在缺失值:缺失值需要补充;
  • 信息利用率低:不同的机器学习算法和模型对数据中信息的利用是不同的,之前提到在线性模型中,使用对定性特征哑编码可以达到非线性的效果。类似地,对定量变量多项式化,或者进行其他的转换,都能达到非线性的效果。

特别地,需要重点关注缺失值的处理,异常值的处理,数据归一化,数据编码等关键问题。

3. Feature Selection

特征选择意味着选择那些在我们的模型上可以提高性能的特征变量。可以使用了一些机器学习和统计方法来选择最有相关的特征来改进模型性能。

4. Model Building

模型部分通常可以选择机器学习模型,也可以选择使用深度学习模型,特别地,很多时候模型集成往往有着出人意料的效果。

赛题分析

赛题数据

赛题以预测二手车的交易价格为任务,该数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取15万条作为训练集,5万条作为测试集,同时会对name、model、brand和regionCode等信息进行脱敏。
数据链接:[https://tianchi.aliyun.com/competition/entrance/231784/introduction]

评测标准

评价标准为MAE(Mean Absolute Error)。

导入基本模块

# 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
from tqdm import tqdm
import itertools

warnings.filterwarnings('ignore')
%matplotlib inline

## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor

## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA

## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

import scipy.signal as signal

数据分析与特征工程

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype(&#039;category&#039;)

    end_mem = df.memory_usage().sum() 
    print(&#039;Memory usage after optimization is: {:.2f} MB&#039;.format(end_mem))
    print(&#039;Decreased by {:.1f}%&#039;.format(100 * (start_mem - end_mem) / start_mem))
    return df
Train_data = reduce_mem_usage(pd.read_csv(&#039;used_car_train_20200313.csv&#039;, sep=&#039; &#039;))
Test_data = reduce_mem_usage(pd.read_csv(&#039;used_car_testB_20200421.csv&#039;, sep=&#039; &#039;))
## 输出数据的大小信息
print(&#039;Train data shape:&#039;,Train_data.shape)
print(&#039;TestA data shape:&#039;,Test_data.shape)

数据挖掘入门——以二手车价格预测为例-Mo 动态

#合并数据集
concat_data = pd.concat([Train_data,Test_data])
concat_data.isnull().sum()

数据挖掘入门——以二手车价格预测为例-Mo 动态
这里我们发现bodyType,fuelType和gearbox缺失比较多,model缺失一行,price由于是输出,这里不用额外处理。

分析V系列匿名特征和非V系列特征

对于匿名变量来说,只有数值信息,需要更多关注到,这里把变量分为匿名变量和非匿名变量,单独进行分析

concat_data.columns

数据挖掘入门——以二手车价格预测为例-Mo 动态

首先提取非匿名变量进行分析,随机抽样10行数据.

concat_data[[&#039;bodyType&#039;, &#039;brand&#039;, &#039;creatDate&#039;, &#039;fuelType&#039;, &#039;gearbox&#039;,
       &#039;kilometer&#039;, &#039;model&#039;, &#039;name&#039;, &#039;notRepairedDamage&#039;, &#039;offerType&#039;, &#039;power&#039;,
    &#039;regDate&#039;, &#039;regionCode&#039;, &#039;seller&#039;]].sample(10)

数据挖掘入门——以二手车价格预测为例-Mo 动态

concat_data[[&#039;bodyType&#039;, &#039;brand&#039;, &#039;creatDate&#039;, &#039;fuelType&#039;, &#039;gearbox&#039;,
       &#039;kilometer&#039;, &#039;model&#039;, &#039;name&#039;, &#039;notRepairedDamage&#039;, &#039;offerType&#039;, &#039;power&#039;,
    &#039;regDate&#039;, &#039;regionCode&#039;, &#039;seller&#039;]].describe()

数据挖掘入门——以二手车价格预测为例-Mo 动态
这里发现了列名为notRepairedDamage的取值中包含了"-"的异常值,这里使用众数进行替换

concat_data[&#039;notRepairedDamage&#039;].value_counts()

数据挖掘入门——以二手车价格预测为例-Mo 动态

concat_data[&#039;notRepairedDamage&#039;] = concat_data[&#039;notRepairedDamage&#039;].replace(&#039;-&#039;,0).astype(&#039;float16&#039;)

接着继续分析匿名变量

concat_data[[&#039;v_0&#039;, &#039;v_1&#039;, &#039;v_2&#039;, &#039;v_3&#039;,
       &#039;v_4&#039;, &#039;v_5&#039;, &#039;v_6&#039;, &#039;v_7&#039;, &#039;v_8&#039;, &#039;v_9&#039;, &#039;v_10&#039;, &#039;v_11&#039;, &#039;v_12&#039;,
       &#039;v_13&#039;, &#039;v_14&#039;]].sample(10)

数据挖掘入门——以二手车价格预测为例-Mo 动态

concat_data[[&#039;v_0&#039;, &#039;v_1&#039;, &#039;v_2&#039;, &#039;v_3&#039;,
       &#039;v_4&#039;, &#039;v_5&#039;, &#039;v_6&#039;, &#039;v_7&#039;, &#039;v_8&#039;, &#039;v_9&#039;, &#039;v_10&#039;, &#039;v_11&#039;, &#039;v_12&#039;,
       &#039;v_13&#039;, &#039;v_14&#039;]].describe()

数据挖掘入门——以二手车价格预测为例-Mo 动态

对于缺失值,先简单的使用众数进行填充。
填充完后,数据中不再含有缺失值。

concat_data = concat_data.fillna(concat_data.mode().iloc[0,:])
print(&#039;concat_data shape:&#039;,concat_data.shape)
concat_data.isnull().sum()

数据挖掘入门——以二手车价格预测为例-Mo 动态

对离散型数值进行独热编码

对于每一个特征,如果它有m个可能值,那么经过独热编码后,就变成了m个二元特征(如成绩这个特征有好,中,差变成one-hot就是100, 010, 001)。并且,这些特征互斥,每次只有一个激活。因此,数据会变成稀疏的。
这样做的好处主要有:

  1. 解决了分类器不好处理属性数据的问题

  2. 在一定程度上也起到了扩充特征的作用

参考链接[https://www.cnblogs.com/zongfa/p/9305657.html]

可以通过df.value_counts().plot.bar来绘制数值分布情况

def plot_discrete_bar(data) :
    cnt = data.value_counts()
    p1 = plt.bar(cnt.index, height=list(cnt) , width=0.8)
    for x,y in zip(cnt.index,list(cnt)):
        plt.text(x+0.05,y+0.05,&#039;%.2f&#039; %y, ha=&#039;center&#039;,va=&#039;bottom&#039;)
clo_list = [&#039;bodyType&#039;,&#039;fuelType&#039;,&#039;gearbox&#039;,&#039;notRepairedDamage&#039;]
i = 1
fig = plt.figure(figsize=(8,8))
for col in clo_list:
    plt.subplot(2,2,i)
    plot_discrete_bar(concat_data[col])
    i = i + 1

数据挖掘入门——以二手车价格预测为例-Mo 动态

对类别较少的特征采用one-hot编码,编码后特征由31个变为50个。

one_hot_list = [&#039;gearbox&#039;,&#039;notRepairedDamage&#039;,&#039;bodyType&#039;,&#039;fuelType&#039;]
for col in one_hot_list:
    one_hot = pd.get_dummies(concat_data[col])
    one_hot.columns = [col+&#039;_&#039;+str(i) for i in range(len(one_hot.columns))]
    concat_data = pd.concat([concat_data,one_hot],axis=1)

这里发现seller和offerType虽然应该是二分类的取值,但是分布情况都偏向于一种取值结果,可以直接删掉

concat_data[&#039;seller&#039;].value_counts()

数据挖掘入门——以二手车价格预测为例-Mo 动态

concat_data[&#039;offerType&#039;].value_counts()

数据挖掘入门——以二手车价格预测为例-Mo 动态

concat_data.drop([&#039;offerType&#039;,&#039;seller&#039;],axis=1,inplace=True)

对于匿名变量来说,希望能更多的使用到这里的数值信息,通过选取若干个非匿名变量和匿名变量进行加法和乘法的数值操作,扩展数据的特征

for i in [&#039;v_&#039; +str(t) for t in range(14)]:
    for j in [&#039;v_&#039; +str(k) for k in range(int(i[2:])+1,15)]:
        concat_data[str(i)+&#039;+&#039;+str(j)] = concat_data[str(i)]+concat_data[str(j)]

for i in [&#039;model&#039;,&#039;brand&#039;, &#039;bodyType&#039;, &#039;fuelType&#039;,&#039;gearbox&#039;, &#039;power&#039;, &#039;kilometer&#039;, &#039;notRepairedDamage&#039;, &#039;regionCode&#039;]:
    for j in [&#039;v_&#039; +str(i) for i in range(14)]:
        concat_data[str(i)+&#039;*&#039;+str(j)] = concat_data[i]*concat_data[j]    
concat_data.shape

数据挖掘入门——以二手车价格预测为例-Mo 动态

对日期数据处理

日期数据同样是很重要的数据,有着具体的实际含义。这里我们先提取出日期中的年月日,再具体分析每个日期的数据。

# 设置日期的格式,例如20160404,设为2016-04-04,其中月份从1-12
def date_proc(x):
    m = int(x[4:6])
    if m == 0:
        m = 1
    return x[:4] + &#039;-&#039; + str(m) + &#039;-&#039; + x[6:]
#定义日期提取函数
def date_transform(df,fea_col):
    for f in tqdm(fea_col):
        df[f] = pd.to_datetime(df[f].astype(&#039;str&#039;).apply(date_proc))
        df[f + &#039;_year&#039;] = df[f].dt.year
        df[f + &#039;_month&#039;] = df[f].dt.month
        df[f + &#039;_day&#039;] = df[f].dt.day
        df[f + &#039;_dayofweek&#039;] = df[f].dt.dayofweek
    return (df)
#提取日期信息
date_cols = [&#039;regDate&#039;, &#039;creatDate&#039;]
concat_data = date_transform(concat_data,date_cols)

数据挖掘入门——以二手车价格预测为例-Mo 动态
继续使用日期的数据,构造其他特征。分析var=data['creatDate'] - data['regDate']的含义,var代表了车辆注册日期和创建交易的日期之间相差的天数,可以从侧面反映出汽车使用的时长,一般来说价格与使用时间成反比
不过要注意,数据里有时间出错的格式,所以我们需要 errors='coerce'

data = concat_data.copy()
# 统计使用天数
data[&#039;used_time1&#039;] = (pd.to_datetime(data[&#039;creatDate&#039;], format=&#039;%Y%m%d&#039;, errors=&#039;coerce&#039;) - 
                            pd.to_datetime(data[&#039;regDate&#039;], format=&#039;%Y%m%d&#039;, errors=&#039;coerce&#039;)).dt.days
data[&#039;used_time2&#039;] = (pd.datetime.now() - pd.to_datetime(data[&#039;regDate&#039;], format=&#039;%Y%m%d&#039;, errors=&#039;coerce&#039;)).dt.days                        
data[&#039;used_time3&#039;] = (pd.datetime.now() - pd.to_datetime(data[&#039;creatDate&#039;], format=&#039;%Y%m%d&#039;, errors=&#039;coerce&#039;) ).dt.days
#分桶操作,划分到区间内
def cut_group(df,cols,num_bins=50):
    for col in cols:
        all_range = int(df[col].max()-df[col].min())
#         print(all_range)
        bin = [i*all_range/num_bins for i in range(all_range)]
        df[col+&#039;_bin&#039;] = pd.cut(df[col], bin, labels=False)   # 使用cut方法进行分箱
    return df

#分桶操作
cut_cols = [&#039;used_time1&#039;,&#039;used_time2&#039;,&#039;used_time3&#039;]
data = cut_group(data,cut_cols,50)

#分桶操作
data = cut_group(data,[&#039;kilometer&#039;],10)

对年份和月份进行处理,继续使用独热编码

data[&#039;creatDate_year&#039;].value_counts()

数据挖掘入门——以二手车价格预测为例-Mo 动态

data[&#039;creatDate_month&#039;].value_counts()
# data[&#039;regDate_year&#039;].value_counts()

数据挖掘入门——以二手车价格预测为例-Mo 动态

# 对类别较少的特征采用one-hot编码
one_hot_list = [&#039;creatDate_year&#039;,&#039;creatDate_month&#039;,&#039;regDate_month&#039;,&#039;regDate_year&#039;]
for col in one_hot_list:
    one_hot = pd.get_dummies(data[col])
    one_hot.columns = [col+&#039;_&#039;+str(i) for i in range(len(one_hot.columns))]
    data = pd.concat([data,one_hot],axis=1)
# 删除无用的SaleID
data.drop([&#039;SaleID&#039;],axis=1,inplace=True)

增加特征数量

增加特征的数量可以从数理统计的角度出发,通过选取一些变量的数理特性来增加数据维度

# count编码
def count_coding(df,fea_col):
    for f in fea_col:
        df[f + &#039;_count&#039;] = df[f].map(df[f].value_counts())
    return(df)
#count编码
count_list = [&#039;model&#039;, &#039;brand&#039;, &#039;regionCode&#039;,&#039;bodyType&#039;,&#039;fuelType&#039;,&#039;name&#039;,&#039;regDate_year&#039;, &#039;regDate_month&#039;, &#039;regDate_day&#039;,
       &#039;regDate_dayofweek&#039; , &#039;creatDate_month&#039;,&#039;creatDate_day&#039;, &#039;creatDate_dayofweek&#039;,&#039;kilometer&#039;]
data = count_coding(data,count_list)

绘制热力图来分析匿名变量和price的相关性,其中v_0,v_8,v_12相关性较高

temp = Train_data[[&#039;v_0&#039;, &#039;v_1&#039;, &#039;v_2&#039;, &#039;v_3&#039;,
       &#039;v_4&#039;, &#039;v_5&#039;, &#039;v_6&#039;, &#039;v_7&#039;, &#039;v_8&#039;, &#039;v_9&#039;, &#039;v_10&#039;, &#039;v_11&#039;, &#039;v_12&#039;,
       &#039;v_13&#039;, &#039;v_14&#039;,&#039;price&#039;]]
# Zoomed heatmap, correlation matrix
sns.set(rc={&#039;figure.figsize&#039;:(8,6)})
correlation_matrix = temp.corr()

k = 8           #number of variables for heatmap
cols = correlation_matrix.nlargest(k, &#039;price&#039;)[&#039;price&#039;].index
cm = np.corrcoef(temp[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt=&#039;.2f&#039;, annot_kws={&#039;size&#039;: 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

数据挖掘入门——以二手车价格预测为例-Mo 动态

#定义交叉特征统计
def cross_cat_num(df,num_col,cat_col):
    for f1 in tqdm(cat_col):  # 对类别特征遍历
        g = df.groupby(f1, as_index=False)
        for f2 in tqdm(num_col):  # 对数值特征遍历
            feat = g[f2].agg({
                &#039;{}_{}_max&#039;.format(f1, f2): &#039;max&#039;, 
                &#039;{}_{}_min&#039;.format(f1, f2): &#039;min&#039;,
                &#039;{}_{}_median&#039;.format(f1, f2): &#039;median&#039;,
            })
            df = df.merge(feat, on=f1, how=&#039;left&#039;)
    return(df)
# 用数值特征对类别特征做统计刻画,挑了几个跟price相关性最高的匿名特征
cross_cat = [&#039;model&#039;, &#039;brand&#039;,&#039;regDate_year&#039;]
cross_num = [&#039;v_0&#039;,&#039;v_3&#039;, &#039;v_4&#039;, &#039;v_8&#039;, &#039;v_12&#039;,&#039;power&#039;]
data = cross_cat_num(data,cross_num,cross_cat)#一阶交叉

数据挖掘入门——以二手车价格预测为例-Mo 动态

划分数据集

## 选择特征列
numerical_cols = data.columns
feature_cols = [col for col in numerical_cols if col not in [&#039;price&#039;]]

## 提前特征列,标签列构造训练样本和测试样本
X_data = data.iloc[:len(Train_data),:][feature_cols]
Y_data = Train_data[&#039;price&#039;]
X_test  = data.iloc[len(Train_data):,:][feature_cols]
print("X_data: ",X_data.shape)
print("X_test: ",X_test.shape)

数据挖掘入门——以二手车价格预测为例-Mo 动态

平均数编码:针对高基数定性特征(类别特征)的数据预处理

定性特征的基数(cardinality)指的是这个定性特征所有可能的不同值的数量。在高基数(high cardinality)的定性特征面前,这些数据预处理的方法往往得不到令人满意的结果。

高基数定性特征的例子:IP地址、电子邮件域名、城市名、家庭住址、街道、产品号码。

主要原因:

  • LabelEncoder编码高基数定性特征,虽然只需要一列,但是每个自然数都具有不同的重要意义,对于y而言线性不可分。使用简单模型,容易欠拟合(underfit),无法完全捕获不同类别之间的区别;使用复杂模型,容易在其他地方过拟合(overfit)。
  • OneHotEncoder编码高基数定性特征,必然产生上万列的稀疏矩阵,易消耗大量内存和训练时间,除非算法本身有相关优化(例:SVM)。

因此,我们可以尝试使用平均数编码(mean encoding)的编码方法,在贝叶斯的架构下,利用所要预测的应变量(target variable),有监督地确定最适合这个定性特征的编码方式。在Kaggle的数据竞赛中,这也是一种常见的提高分数的手段。参考链接[https://blog.csdn.net/juzexia/article/details/78581462]

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold,KFold
from itertools import product
class MeanEncoder:
    def __init__(self, categorical_features, n_splits=10, target_type=&#039;classification&#039;, prior_weight_func=None):
        """
        :param categorical_features: list of str, the name of the categorical columns to encode

        :param n_splits: the number of splits used in mean encoding

        :param target_type: str, &#039;regression&#039; or &#039;classification&#039;

        :param prior_weight_func:
        a function that takes in the number of observations, and outputs prior weight
        when a dict is passed, the default exponential decay function will be used:
        k: the number of observations needed for the posterior to be weighted equally as the prior
        f: larger f --> smaller slope
        """

        self.categorical_features = categorical_features
        self.n_splits = n_splits
        self.learned_stats = {}

        if target_type == &#039;classification&#039;:
            self.target_type = target_type
            self.target_values = []
        else:
            self.target_type = &#039;regression&#039;
            self.target_values = None

        if isinstance(prior_weight_func, dict):
            self.prior_weight_func = eval(&#039;lambda x: 1 / (1 + np.exp((x - k) / f))&#039;, dict(prior_weight_func, np=np))
        elif callable(prior_weight_func):
            self.prior_weight_func = prior_weight_func
        else:
            self.prior_weight_func = lambda x: 1 / (1 + np.exp((x - 2) / 1))

    @staticmethod
    def mean_encode_subroutine(X_train, y_train, X_test, variable, target, prior_weight_func):
        X_train = X_train[[variable]].copy()
        X_test = X_test[[variable]].copy()

        if target is not None:
            nf_name = &#039;{}_pred_{}&#039;.format(variable, target)
            X_train[&#039;pred_temp&#039;] = (y_train == target).astype(int)  # classification
        else:
            nf_name = &#039;{}_pred&#039;.format(variable)
            X_train[&#039;pred_temp&#039;] = y_train  # regression
        prior = X_train[&#039;pred_temp&#039;].mean()

        col_avg_y = X_train.groupby(by=variable, axis=0)[&#039;pred_temp&#039;].agg({&#039;mean&#039;: &#039;mean&#039;, &#039;beta&#039;: &#039;size&#039;})
        col_avg_y[&#039;beta&#039;] = prior_weight_func(col_avg_y[&#039;beta&#039;])
        col_avg_y[nf_name] = col_avg_y[&#039;beta&#039;] * prior + (1 - col_avg_y[&#039;beta&#039;]) * col_avg_y[&#039;mean&#039;]
        col_avg_y.drop([&#039;beta&#039;, &#039;mean&#039;], axis=1, inplace=True)

        nf_train = X_train.join(col_avg_y, on=variable)[nf_name].values
        nf_test = X_test.join(col_avg_y, on=variable).fillna(prior, inplace=False)[nf_name].values

        return nf_train, nf_test, prior, col_avg_y

    def fit_transform(self, X, y):
        """
        :param X: pandas DataFrame, n_samples * n_features
        :param y: pandas Series or numpy array, n_samples
        :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features
        """
        X_new = X.copy()
        if self.target_type == &#039;classification&#039;:
            skf = StratifiedKFold(self.n_splits)
        else:
            skf = KFold(self.n_splits)

        if self.target_type == &#039;classification&#039;:
            self.target_values = sorted(set(y))
            self.learned_stats = {&#039;{}_pred_{}&#039;.format(variable, target): [] for variable, target in
                                  product(self.categorical_features, self.target_values)}
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = &#039;{}_pred_{}&#039;.format(variable, target)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, target, self.prior_weight_func)
                    X_new.iloc[small_ind, -1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        else:
            self.learned_stats = {&#039;{}_pred&#039;.format(variable): [] for variable in self.categorical_features}
            for variable in self.categorical_features:
                nf_name = &#039;{}_pred&#039;.format(variable)
                X_new.loc[:, nf_name] = np.nan
                for large_ind, small_ind in skf.split(y, y):
                    nf_large, nf_small, prior, col_avg_y = MeanEncoder.mean_encode_subroutine(
                        X_new.iloc[large_ind], y.iloc[large_ind], X_new.iloc[small_ind], variable, None, self.prior_weight_func)
                    X_new.iloc[small_ind, -1] = nf_small
                    self.learned_stats[nf_name].append((prior, col_avg_y))
        return X_new

    def transform(self, X):
        """
        :param X: pandas DataFrame, n_samples * n_features
        :return X_new: the transformed pandas DataFrame containing mean-encoded categorical features
        """
        X_new = X.copy()

        if self.target_type == &#039;classification&#039;:
            for variable, target in product(self.categorical_features, self.target_values):
                nf_name = &#039;{}_pred_{}&#039;.format(variable, target)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits
        else:
            for variable in self.categorical_features:
                nf_name = &#039;{}_pred&#039;.format(variable)
                X_new[nf_name] = 0
                for prior, col_avg_y in self.learned_stats[nf_name]:
                    X_new[nf_name] += X_new[[variable]].join(col_avg_y, on=variable).fillna(prior, inplace=False)[
                        nf_name]
                X_new[nf_name] /= self.n_splits

        return X_new
# 高基数定性特征:name汽车交易名称,brand汽车品牌,regionCode地区编码
class_list = [&#039;model&#039;,&#039;brand&#039;,&#039;name&#039;,&#039;regionCode&#039;]+date_cols  # date_cols = [&#039;regDate&#039;, &#039;creatDate&#039;]
MeanEnocodeFeature = class_list   # 声明需要平均数编码的特征
ME = MeanEncoder(MeanEnocodeFeature,target_type=&#039;regression&#039;) # 声明平均数编码的类
X_data = ME.fit_transform(X_data,Y_data)   # 对训练数据集的X和y进行拟合
X_test = ME.transform(X_test)#对测试集进行编码
X_data[&#039;price&#039;] = Train_data[&#039;price&#039;]
from sklearn.model_selection import KFold
# target encoding目标编码,回归场景相对来说做目标编码的选择更多,不仅可以做均值编码,还可以做标准差编码、中位数编码等
enc_cols = []
stats_default_dict = {
    &#039;max&#039;: X_data[&#039;price&#039;].max(),
    &#039;min&#039;: X_data[&#039;price&#039;].min(),
    &#039;median&#039;: X_data[&#039;price&#039;].median(),
    &#039;mean&#039;: X_data[&#039;price&#039;].mean(),
    &#039;sum&#039;: X_data[&#039;price&#039;].sum(),
    &#039;std&#039;: X_data[&#039;price&#039;].std(),
    &#039;skew&#039;: X_data[&#039;price&#039;].skew(),
    &#039;kurt&#039;: X_data[&#039;price&#039;].kurt(),
    &#039;mad&#039;: X_data[&#039;price&#039;].mad()
}
### 暂且选择这三种编码
enc_stats = [&#039;max&#039;,&#039;min&#039;,&#039;mean&#039;]
skf = KFold(n_splits=10, shuffle=True, random_state=42)
for f in tqdm([&#039;regionCode&#039;,&#039;brand&#039;,&#039;regDate_year&#039;,&#039;creatDate_year&#039;,&#039;kilometer&#039;,&#039;model&#039;]):
    enc_dict = {}
    for stat in enc_stats:
        enc_dict[&#039;{}_target_{}&#039;.format(f, stat)] = stat
        X_data[&#039;{}_target_{}&#039;.format(f, stat)] = 0
        X_test[&#039;{}_target_{}&#039;.format(f, stat)] = 0
        enc_cols.append(&#039;{}_target_{}&#039;.format(f, stat))
    for i, (trn_idx, val_idx) in enumerate(skf.split(X_data, Y_data)):
        trn_x, val_x = X_data.iloc[trn_idx].reset_index(drop=True), X_data.iloc[val_idx].reset_index(drop=True)
        enc_df = trn_x.groupby(f, as_index=False)[&#039;price&#039;].agg(enc_dict)
        val_x = val_x[[f]].merge(enc_df, on=f, how=&#039;left&#039;)
        test_x = X_test[[f]].merge(enc_df, on=f, how=&#039;left&#039;)
        for stat in enc_stats:
            val_x[&#039;{}_target_{}&#039;.format(f, stat)] = val_x[&#039;{}_target_{}&#039;.format(f, stat)].fillna(stats_default_dict[stat])
            test_x[&#039;{}_target_{}&#039;.format(f, stat)] = test_x[&#039;{}_target_{}&#039;.format(f, stat)].fillna(stats_default_dict[stat])
            X_data.loc[val_idx, &#039;{}_target_{}&#039;.format(f, stat)] = val_x[&#039;{}_target_{}&#039;.format(f, stat)].values 
            X_test[&#039;{}_target_{}&#039;.format(f, stat)] += test_x[&#039;{}_target_{}&#039;.format(f, stat)].values / skf.n_splits

数据挖掘入门——以二手车价格预测为例-Mo 动态

drop_list = [&#039;regDate&#039;, &#039;creatDate&#039;,&#039;brand_power_min&#039;, &#039;regDate_year_power_min&#039;]
x_train = X_data.drop(drop_list+[&#039;price&#039;],axis=1)
x_test = X_test.drop(drop_list,axis=1)
x_train.shape

数据挖掘入门——以二手车价格预测为例-Mo 动态

x_train = x_train.astype(&#039;float32&#039;)
x_test = x_test.astype(&#039;float32&#039;)

使用MinMaxScaler处理数据,再使用PCA降低维度

from sklearn.preprocessing import MinMaxScaler
#特征归一化
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(pd.concat([x_train,x_test]).values)
all_data = min_max_scaler.transform(pd.concat([x_train,x_test]).values)
print(all_data.shape)
from sklearn import decomposition
pca = decomposition.PCA(n_components=400)
all_pca = pca.fit_transform(all_data)
X_pca = all_pca[:len(x_train)]
test = all_pca[len(x_train):]
y = Train_data[&#039;price&#039;].values
print(all_pca.shape)

数据挖掘入门——以二手车价格预测为例-Mo 动态

模型选择

这里以keras搭建基础的神经网络模型,模型结构选取全连接神经网络。
数据挖掘入门——以二手车价格预测为例-Mo 动态

from keras.layers import Conv1D, Activation, MaxPool1D, Flatten, Dense
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout, merge, Add
def NN_model(input_dim):
    init = keras.initializers.glorot_uniform(seed=1)
    model = keras.models.Sequential()
    model.add(Dense(units=300, input_dim=input_dim, kernel_initializer=init, activation=&#039;softplus&#039;))
    #model.add(Dropout(0.2))
    model.add(Dense(units=300, kernel_initializer=init, activation=&#039;softplus&#039;))
    #model.add(Dropout(0.2))
    model.add(Dense(units=64, kernel_initializer=init, activation=&#039;softplus&#039;))
    model.add(Dense(units=32, kernel_initializer=init, activation=&#039;softplus&#039;))
    model.add(Dense(units=8, kernel_initializer=init, activation=&#039;softplus&#039;))
    model.add(Dense(units=1))
    return model
from keras.callbacks import Callback, EarlyStopping
class Metric(Callback):
    def __init__(self, model, callbacks, data):
        super().__init__()
        self.model = model
        self.callbacks = callbacks
        self.data = data

    def on_train_begin(self, logs=None):
        for callback in self.callbacks:
            callback.on_train_begin(logs)

    def on_train_end(self, logs=None):
        for callback in self.callbacks:
            callback.on_train_end(logs)

    def on_epoch_end(self, batch, logs=None):
        X_train, y_train = self.data[0][0], self.data[0][1]
        y_pred3 = self.model.predict(X_train)
        y_pred = np.zeros((len(y_pred3), ))
        y_true = np.zeros((len(y_pred3), ))
        for i in range(len(y_pred3)):
            y_pred[i] = y_pred3[i]
        for i in range(len(y_pred3)):
            y_true[i] = y_train[i]
        trn_s = mean_absolute_error(y_true, y_pred)
        logs[&#039;trn_score&#039;] = trn_s

        X_val, y_val = self.data[1][0], self.data[1][1]
        y_pred3 = self.model.predict(X_val)
        y_pred = np.zeros((len(y_pred3), ))
        y_true = np.zeros((len(y_pred3), ))
        for i in range(len(y_pred3)):
            y_pred[i] = y_pred3[i]
        for i in range(len(y_pred3)):
            y_true[i] = y_val[i]
        val_s = mean_absolute_error(y_true, y_pred)
        logs[&#039;val_score&#039;] = val_s
        print(&#039;trn_score&#039;, trn_s, &#039;val_score&#039;, val_s)

        for callback in self.callbacks:
            callback.on_epoch_end(batch, logs)
import keras.backend as K
from keras.callbacks import LearningRateScheduler

def scheduler(epoch):
    # 每隔20个epoch,学习率减小为原来的0.5
    if epoch % 20 == 0 and epoch != 0:
        lr = K.get_value(model.optimizer.lr)
        K.set_value(model.optimizer.lr, lr * 0.5)
        print("lr changed to {}".format(lr * 0.5))
    return K.get_value(model.optimizer.lr)
reduce_lr = LearningRateScheduler(scheduler)
#model.fit(train_x, train_y, batch_size=32, epochs=5, callbacks=[reduce_lr])
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)

import keras 

b_size = 2000
max_epochs = 145
oof_pred = np.zeros((len(X_pca), ))

sub = pd.read_csv(&#039;used_car_testB_20200421.csv&#039;,sep = &#039; &#039;)[[&#039;SaleID&#039;]].copy()
sub[&#039;price&#039;] = 0

avg_mae = 0
for fold, (trn_idx, val_idx) in enumerate(kf.split(X_pca, y)):
    print(&#039;fold:&#039;, fold)
    X_train, y_train = X_pca[trn_idx], y[trn_idx]
    X_val, y_val = X_pca[val_idx], y[val_idx]

    model = NN_model(X_train.shape[1])
    simple_adam = keras.optimizers.Adam(lr = 0.01)

    model.compile(loss=&#039;mae&#039;, optimizer=simple_adam,metrics=[&#039;mae&#039;])
    es = EarlyStopping(monitor=&#039;val_score&#039;, patience=10, verbose=0, mode=&#039;min&#039;, restore_best_weights=True,)
    es.set_model(model)
    metric = Metric(model, [es], [(X_train, y_train), (X_val, y_val)])
    model.fit(X_train, y_train, batch_size=b_size, epochs=max_epochs, 
              validation_data = [X_val, y_val],
              callbacks=[reduce_lr], shuffle=True, verbose=0)
    y_pred3 = model.predict(X_val)
    y_pred = np.zeros((len(y_pred3), ))
    sub[&#039;price&#039;] += model.predict(test).reshape(-1,)/n_splits
    for i in range(len(y_pred3)):
        y_pred[i] = y_pred3[i]

    oof_pred[val_idx] = y_pred
    val_mae = mean_absolute_error(y[val_idx], y_pred)
    avg_mae += val_mae/n_splits
    print()
    print(&#039;val_mae is:{}&#039;.format(val_mae))
    print()
mean_absolute_error(y, oof_pred)

数据挖掘入门——以二手车价格预测为例-Mo 动态