【动手学深度学习】4.10 实战Kaggle比赛:预测房价

发布于:2025-07-11 ⋅ 阅读:(17) ⋅ 点赞:(0)


4.10 实战Kaggle比赛:预测房价

数据来源:Kaggle房价预测比赛

.

1)数据预处理

读取数据

import pandas as pd

train_data = pd.read_csv('../data/kaggle_house_pred_train.csv')
test_data = pd.read_csv('../data/kaggle_house_pred_test.csv')

all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

处理数值和类别特征

# 数值特征标准化
numeric_feats = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_feats] = all_features[numeric_feats].apply(
    lambda x: (x - x.mean()) / x.std()
)
all_features[numeric_feats] = all_features[numeric_feats].fillna(0)

# 类别特征独热编码
all_features = pd.get_dummies(all_features, dummy_na=True)

转换为张量

import torch

n_train = train_data.shape[0]
X = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
X_test = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
y = torch.tensor(train_data.SalePrice.values, dtype=torch.float32).reshape(-1, 1)

.

2)模型定义与训练

模型定义

from torch import nn

def get_net():
    net = nn.Sequential(nn.Linear(X.shape[1], 1))
    return net

损失函数:对数均方根误差(log RMSE)

import torch.nn.functional as F

def log_rmse(net, features, labels):
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(F.mse_loss(torch.log(clipped_preds), torch.log(labels)))
    return rmse.item()

训练函数

def train(net, train_features, train_labels, test_features, test_labels,
          num_epochs, learning_rate, weight_decay, batch_size):

    train_ls, test_ls = [], []
    dataset = torch.utils.data.TensorDataset(train_features, train_labels)
    train_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True)
    
    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate, weight_decay=weight_decay)
    
    for epoch in range(num_epochs):
        for X_batch, y_batch in train_iter:
            optimizer.zero_grad()
            loss = F.mse_loss(net(X_batch), y_batch)
            loss.backward()
            optimizer.step()
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
    return train_ls, test_ls

.

3)模型评估与预测

K折交叉验证

def get_k_fold_data(k, i, X, y):
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat((X_train, X_part), 0)
            y_train = torch.cat((y_train, y_part), 0)
    return X_train, y_train, X_valid, y_valid

def k_fold(k, X_train, y_train, num_epochs, lr, weight_decay, batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net, *data, num_epochs, lr, weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        print(f'Fold {i+1}, Train log rmse: {train_ls[-1]:.4f}, Valid log rmse: {valid_ls[-1]:.4f}')
    return train_l_sum / k, valid_l_sum / k

.

4)模型训练与预测提交

使用全部数据训练并预测

def train_and_pred(train_features, test_features, train_labels, test_data,
                   num_epochs, lr, weight_decay, batch_size):
    net = get_net()
    train(net, train_features, train_labels, None, None, num_epochs, lr, weight_decay, batch_size)
    preds = net(test_features).detach().numpy()
    
    submission = pd.DataFrame({
        "Id": test_data.Id,
        "SalePrice": preds.flatten()
    })
    submission.to_csv('submission.csv', index=False)

.

5)示例超参数(可调)

k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, X, y, num_epochs, lr, weight_decay, batch_size)
print(f'{k}-fold validation: Avg train log rmse: {train_l:.4f}, Avg valid log rmse: {valid_l:.4f}')
# 最终提交
train_and_pred(X, X_test, y, test_data, num_epochs, lr, weight_decay, batch_size)

.

总结:

  • 核心流程:数据预处理 → 建模 → K折验证 → 全数据训练 → 生成提交文件。

  • 模型简单但有效:线性回归 + 标准化 + One-Hot。

  • log_rmse 是比赛评分标准的重要转化。

.


声明:资源可能存在第三方来源,若有侵权请联系删除!


网站公告

今日签到

点亮在社区的每一天
去签到