机器学习 - Kaggle项目实践(2)房价预测问题

发布于:2025-08-15 ⋅ 阅读:(14) ⋅ 点赞:(0)

House Prices - Advanced Regression Techniques | Kaggle 原问题

上为问题链接 下二为两篇参考讲解  再下两篇分别为 tfdf和集成多模型 我自己的Kaggle版本。

参考一 TFDF: House Prices Prediction using TFDF | Kaggle

参考二 集成学习: Stacked Regressions : Top 4% on LeaderBoard | Kaggle

前两部分的 House Prices | Kaggle

集成学习版 House Prices-2 | Kaggle

1. 数据分析

先进行数据导入

import pandas as pd
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_train.columns

对最核心的售价 SalePrice进行 describe;distplot;偏度峰度初步分析

import seaborn as sns
print(df_train['SalePrice'].describe())
sns.distplot(df_train['SalePrice']) #直方图和拟合曲线
print("Skewness: %f" % df_train['SalePrice'].skew()) #偏度 对称性
print("Kurtosis: %f" % df_train['SalePrice'].kurt()) #峰度 尖锐程度
df_train = df_train.drop('Id', axis=1)
df_train.info()

看一下表中有哪些数据类型,可以提取数值量,画出直方图(hist)

print(list(set(df_train.dtypes.tolist()))) # 有小数 整数 对象/字符串
df_num = df_train.select_dtypes(include = ['float64', 'int64']) # 提取出数值型
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8); # 所有数值型列的直方图

还可选取两个指标画出散点图 看出大致的正相关性

import matplotlib.pyplot as plt
import pandas as pd

# 设置绘图窗口
plt.figure(figsize=(12, 6))

# 第一个子图:SalePrice vs GrLivArea
var1 = 'GrLivArea'
data1 = pd.concat([df_train['SalePrice'], df_train[var1]], axis=1)
plt.subplot(1, 2, 1)  # 1行2列的第一个子图
data1.plot.scatter(x=var1, y='SalePrice', ylim=(0, 800000), ax=plt.gca())
plt.title(f'SalePrice vs {var1}')  # 添加标题

# 第二个子图:SalePrice vs TotalBsmtSF
var2 = 'TotalBsmtSF'
data2 = pd.concat([df_train['SalePrice'], df_train[var2]], axis=1)
plt.subplot(1, 2, 2)  # 1行2列的第二个子图
data2.plot.scatter(x=var2, y='SalePrice', ylim=(0, 800000), ax=plt.gca())
plt.title(f'SalePrice vs {var2}')  # 添加标题

# 显示图形
plt.tight_layout()  # 自动调整子图间的间距
plt.show()

两个箱线图,分别展示了房屋质量和建造年份对房价的影响

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# 设置绘图窗口
plt.figure(figsize=(16, 8))

# 第一个子图:OverallQual vs SalePrice
var1 = 'OverallQual'
data1 = pd.concat([df_train['SalePrice'], df_train[var1]], axis=1)
plt.subplot(1, 2, 1)  # 1行2列的第一个子图
sns.boxplot(x=var1, y="SalePrice", data=data1)
plt.ylim(0, 800000)  # 设置y轴范围
plt.title(f'SalePrice vs {var1}')  # 添加标题

# 第二个子图:YearBuilt vs SalePrice
var2 = 'YearBuilt'
data2 = pd.concat([df_train['SalePrice'], df_train[var2]], axis=1)
plt.subplot(1, 2, 2)  # 1行2列的第二个子图
sns.boxplot(x=var2, y="SalePrice", data=data2)
plt.ylim(0, 800000)  # 设置y轴范围
plt.xticks(rotation=90)  # 将x轴标签旋转90度
plt.title(f'SalePrice vs {var2}')  # 添加标题

# 显示图形
plt.tight_layout()  # 自动调整子图之间的间距
plt.show()

画出数值型变量的热力图;以及与y相关性最高的十个变量

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# 只选择数值型列
df_train_numeric = df_train.select_dtypes(include=[np.number])

# 设置绘图窗口
plt.figure(figsize=(18, 9))

# 第一个子图:绘制所有数值型变量的相关性热图
corrmat = df_train_numeric.corr()  # 计算相关性矩阵
plt.subplot(1, 2, 1)  # 1行2列的第一个子图
sns.heatmap(corrmat, vmax=.8, square=True)  # 绘制相关性热图
plt.title('Correlation Matrix')  # 添加标题

# 第二个子图:绘制与SalePrice相关性最大的前10个变量的相关性热图
cols = corrmat.nlargest(10, 'SalePrice')['SalePrice'].index  # 选择前10个与SalePrice相关性最大的变量
cm = np.corrcoef(df_train_numeric[cols].values.T)  # 计算这10个变量之间的相关性矩阵
plt.subplot(1, 2, 2)  # 1行2列的第二个子图
sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)  # 绘制相关性热图
plt.title('Top 10 Correlated with SalePrice')  # 添加标题

# 显示图形
plt.tight_layout()  # 自动调整子图之间的间距
plt.show()

2. TF-DF  TensorFlow Decision Forests

House Prices Prediction using TFDF | Kaggle

不使用任何预处理 直接调用TensorFlow的库。

2.1 将数据集分割成训练集和测试集;并将 Pandas DataFrame 格式的数据转换为 TensorFlow Decision Forest(TF-DF)所需的格式,以便于用于回归任务。
创建随机森林模型,指定评估标准为MSE,训练+可视化。

import pandas as pd
import numpy as np
import tensorflow_decision_forests as tfdf
from sklearn.model_selection import train_test_split

# 假设你已经加载了数据集 dataset_df

# 将数据集分割为训练集和验证集
train_ds_pd, valid_ds_pd = train_test_split(df_train, test_size=0.30, random_state=42)
label = 'SalePrice'  # 目标变量

# 转换训练集和验证集为 TensorFlow 数据集格式
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)

# 创建 Random Forest 模型
rf = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)

# 编译模型,指定评估指标为均方误差 (MSE)
rf.compile(metrics=["mse"])

# 训练模型
print("Starting model training...")
rf.fit(x=train_ds)
print("Model training complete.")

# 可视化随机森林模型中的一棵树
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

2.2  还可以输出树的数目-RMSE对应图 & 训练评估

import matplotlib.pyplot as plt
inspector = rf.make_inspector()
logs = inspector.training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("RMSE (out-of-bag)")
plt.show()
inspector.evaluation()

2.3 输出 loss和mse信息;根据特征对应的分支数 看出重要程度

evaluation = rf.evaluate(x=valid_ds, return_dict=True)
for name, value in evaluation.items():
    print(f"{name}: {value:.4f}")
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
    print("\t", importance)
inspector.variable_importances()["NUM_AS_ROOT"]

2.4 预测&保存答案

test_data = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")
ids = test_data.pop('Id')

test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data,task = tfdf.keras.Task.REGRESSION)
preds = rf.predict(test_ds)
output = pd.DataFrame({'Id': ids,'SalePrice': preds.squeeze()})

sample_submission_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv')
sample_submission_df['SalePrice'] = rf.predict(test_ds)
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)

3. 集成多模型

Stacked Regressions : Top 4% on LeaderBoard | Kaggle

House Prices-2 | Kaggle

3.1 对于target Salesprice的处理

我们可以画出GrLivArea ~ SalePrice之间的关系散点图。发现右下角的点与整体趋势不符遂删除。

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

我们用正态性分析 和 QQ图 发现SalePrice有右偏趋势,进行log1p对数化处理正态性更好。

import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew 
import numpy as np
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm); #画图并尝试拟合正态

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

3.2 空缺值处理

1. 先把train 和 test 拼接为all_data

2. 把有缺失值的列拎出来 先观察分类一下

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100 #每列缺失比例
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False) #去掉不空的 降序
print(all_data_na)

按类别处理:1)空缺代表没有的(比如车库 泳池这种)填充为None / 0
2)众数填充       3)LotFrontage这个指标临近位置应该都差不多 按邻居中位数

4)Utilities的一列基本所有都是相同的 对分类无效 直接删除
5)Functional 在description中说 空缺代表Typ  填上Typ

3. 把数值型但本该为类别型的(像年月)转为str 

4. 把类别进行顺序编码 再转化为独热编码

5. 对地下室面积 1层面积 2层面积 加和,加入一个变量 总面积

6. 对数值型变量计算偏度 偏度高的进行boxcox1p变换

3.3 模型训练

多模型集成 k折交叉验证 对(ENet, GBoost, KRR, lasso) 进行简单的答案平均 。

n_folds = 5
def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)  

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

还可以使用元学习得到更好的加权效果。

3个模型 每个base基准模型5折 分别预测得到3*5(n*k)的OOF矩阵。

meta模型(这里用LASSO)学习OOF怎么加权更接近y。

fit会返回训练好的base_models基准模型 n*k的 以及训练好的meta_model元模型

predict时 先用base_models_ 预测 n*k的结果

每个模型取平均值得到n个值 再拿meta_model_加权。

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models))) # 准备OOF
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred # OOF为每一折的预测值
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

最后进行一个总的平均 之前集成的Stacking Averaged models占大头70%因为他融合多个模型,泛化能力更稳定。 LGBM 和 XGBoost 各占15%作为补充 填补预测的缺口。

最后的结果需要expm1 将之前的对数化变形回去。

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

# Stacking Averaged models
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))

# XGBoost
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

# LGBM
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))

# 总平均
print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))


网站公告

今日签到

点亮在社区的每一天
去签到