Description
linkkeyboard_arrow_up
👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place.
This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!
1.训练模型
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# 1. 读取训练数据
train_df = pd.read_csv('titanic/train.csv') # 如果你的train.csv在data文件夹下
# 2. 数据预处理
# 映射性别为数值:male -> 0, female -> 1
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
# 用中位数填补 Age 和 Fare 的缺失值
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Fare'].fillna(train_df['Fare'].median(), inplace=True)
# 填补 Embarked 缺失值,并做独热编码
train_df['Embarked'].fillna('S', inplace=True)
embarked_dummies = pd.get_dummies(train_df['Embarked'], prefix='Embarked')
train_df = pd.concat([train_df, embarked_dummies], axis=1)
# 3. 选择特征列(可根据需要扩展)
feature_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked_C', 'Embarked_Q', 'Embarked_S']
X = train_df[feature_cols]
y = train_df['Survived']
# 4. 模型训练
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
print("模型训练完成!")
2.输入测试集并预测
# ==========================
# 1. 读取训练数据并训练模型
# ==========================
train_df = pd.read_csv('titanic/train.csv')
# 性别映射
train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
# 缺失值处理
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)
train_df['Fare'].fillna(train_df['Fare'].median(), inplace=True)
train_df['Embarked'].fillna('S', inplace=True)
# 独热编码 Embarked
embarked_dummies = pd.get_dummies(train_df['Embarked'], prefix='Embarked')
train_df = pd.concat([train_df, embarked_dummies], axis=1)
# 选择特征
feature_cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked_C', 'Embarked_Q', 'Embarked_S']
X = train_df[feature_cols]
y = train_df['Survived']
# 模型训练
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
print("✅ 模型训练完成")
# ==========================
# 2. 加载测试数据并做预测
# ==========================
test_df = pd.read_csv('titanic/test.csv')
# 同样的预处理
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})
test_df['Age'].fillna(train_df['Age'].median(), inplace=True) # 用训练集的中位数更稳健
test_df['Fare'].fillna(train_df['Fare'].median(), inplace=True)
test_df['Embarked'].fillna('S', inplace=True)
# 独热编码 Embarked
embarked_dummies_test = pd.get_dummies(test_df['Embarked'], prefix='Embarked')
# 保证测试集也包含这三列(某些类别可能缺失)
for col in ['Embarked_C', 'Embarked_Q', 'Embarked_S']:
if col not in embarked_dummies_test:
embarked_dummies_test[col] = 0
test_df = pd.concat([test_df, embarked_dummies_test], axis=1)
# 确保列顺序一致
X_test = test_df[feature_cols]
# 预测
predictions = model.predict(X_test)
# ==========================
# 3. 生成提交文件
# ==========================
submission = pd.DataFrame({
'PassengerId': test_df['PassengerId'],
'Survived': predictions
})
submission.to_csv('submission.csv', index=False)
print("✅ 预测完成,提交文件已保存为 submission.csv")
3.提交代码