06 - 机器学习实践指南¶
🎯 实践项目总览¶
项目列表 (难度递增)¶
Text Only
入门级:
1. 加州房价预测 (线性回归)
2. 鸢尾花分类 (逻辑回归/SVM)
进阶级:
3. 泰坦尼克号生存预测 (随机森林)
4. 手写数字识别 (CNN)
5. 客户细分 (K-Means)
高级:
6. 股票价格预测 (LSTM)
7. 图像分类 (迁移学习)
8. 新闻分类 (NLP)
📊 项目1: 加州房价预测¶
目标¶
预测加州房屋的中位数价格
数据集¶
- 样本数: 20,640
- 特征数: 8 (房屋年龄、房间数、卧室数、人口、家庭数、收入、纬度、经度)
- 目标: 房价 (万美元)
重要说明: 原本使用的波士顿房价数据集由于包含种族歧视特征,已被 scikit-learn 官方弃用。我们改用 California Housing 数据集,这是一个更符合伦理标准且同样适合回归任务的数据集。
完整代码¶
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
# 1. 加载数据
california = fetch_california_housing()
X = california.data
y = california.target
feature_names = california.feature_names
print(f"数据形状: {X.shape}")
print(f"特征名: {feature_names}")
# 2. 探索性数据分析 (EDA)
df = pd.DataFrame(X, columns=feature_names)
df['Price'] = y
# 查看数据
print(df.head())
# 相关性分析
corr = df.corr()
print("与房价相关性:")
print(corr['Price'].sort_values(ascending=False))
# 可视化: 房间数 vs 房价
plt.scatter(df['AveRooms'], df['Price'])
plt.xlabel('平均房间数 (AveRooms)')
plt.ylabel('房价 (Price)')
plt.title('平均房间数 vs 房价')
plt.show()
# 3. 数据预处理
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 4. 模型训练
# 线性回归
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
# Ridge回归 (L2正则化)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
# Lasso回归 (L1正则化)
lasso = Lasso(alpha=0.01) # alpha=1.0过大会将大部分系数压为0
lasso.fit(X_train_scaled, y_train)
# 5. 预测与评估
def evaluate_model(model, X_test, y_test, name):
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\n{name}:")
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.4f}")
return y_pred
y_pred_lr = evaluate_model(lr, X_test_scaled, y_test, "线性回归")
y_pred_ridge = evaluate_model(ridge, X_test_scaled, y_test, "Ridge回归")
y_pred_lasso = evaluate_model(lasso, X_test_scaled, y_test, "Lasso回归")
# 6. 可视化预测结果
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, alpha=0.5)
plt.plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('真实房价 (万美元)')
plt.ylabel('预测房价 (万美元)')
plt.title('真实 vs 预测房价')
plt.show()
# 7. 特征重要性
feature_importance = pd.DataFrame({
'Feature': feature_names,
'Coefficient': lr.coef_
}).sort_values('Coefficient', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'],
feature_importance['Coefficient'])
plt.title('特征重要性 (回归系数)')
plt.show()
练习¶
- 尝试不同的alpha值 (正则化强度)
- 添加多项式特征 (非线性关系)
- 使用其他算法 (随机森林、XGBoost)
🚢 项目2: 泰坦尼克号生存预测¶
目标¶
预测乘客是否在泰坦尼克号沉船事故中生存
数据集¶
- 训练集: 891条
- 特征: 姓名、性别、年龄、船票等级等
- 目标: 0 (死亡) / 1 (生存)
完整代码¶
Python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# 1. 加载数据
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
print("训练集形状:", train_df.shape)
print(train_df.head())
# 2. 特征工程
def feature_engineering(df):
# 复制数据
df = df.copy()
# 提取称谓 (Mr, Mrs, Miss等)
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(
['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr',
'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare'
)
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
# 家庭大小
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
# 填充缺失值 (避免使用 inplace=True,现代 pandas 推荐赋值写法)
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Fare'] = df['Fare'].fillna(df['Fare'].median())
return df
# 应用特征工程
train_df = feature_engineering(train_df)
test_df = feature_engineering(test_df)
# 3. 特征选择
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone']
X = train_df[features]
y = train_df['Survived']
# 4. 编码分类变量(每列独立编码,避免共用同一个LabelEncoder导致映射混乱)
for col in ['Sex', 'Embarked', 'Title']:
X[col] = LabelEncoder().fit_transform(X[col])
# 5. 划分数据集
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 6. 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
# 7. 模型训练与对比
models = {
'逻辑回归': LogisticRegression(max_iter=1000),
'随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(probability=True, random_state=42)
}
results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_val_scaled)
acc = accuracy_score(y_val, y_pred)
results[name] = acc
print(f"\n{name}:")
print(f"准确率: {acc:.4f}")
print(f"分类报告:\n{classification_report(y_val, y_pred)}")
# 8. 选择最佳模型
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]
print(f"\n最佳模型: {best_model_name}")
# 9. 混淆矩阵可视化
y_pred = best_model.predict(X_val_scaled)
cm = confusion_matrix(y_val, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'混淆矩阵 - {best_model_name}')
plt.ylabel('真实标签')
plt.xlabel('预测标签')
plt.show()
# 10. 特征重要性 (仅随机森林)
if hasattr(best_model, 'feature_importances_'):
feature_importance = pd.DataFrame({
'Feature': features,
'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'],
feature_importance['Importance'])
plt.title('特征重要性')
plt.show()
🔢 项目3: 手写数字识别 (CNN)¶
目标¶
识别手写数字 (0-9)
数据集¶
- MNIST: 70,000张28×28灰度图
完整代码¶
Python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
# 1. 数据准备
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # MNIST均值标准差
])
train_dataset = datasets.MNIST('./data', train=True,
download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False,
download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) # DataLoader批量加载数据,支持shuffle和多进程
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)
# 2. 定义CNN模型
class DigitCNN(nn.Module):
def __init__(self): # __init__构造方法,创建对象时自动调用
super().__init__() # super()调用父类方法
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
# 卷积层1: 1x28x28 → 32x26x26
x = self.conv1(x)
x = torch.relu(x)
# 卷积层2: 32x26x26 → 64x24x24
x = self.conv2(x)
x = torch.relu(x)
# 池化: 64x24x24 → 64x12x12, flatten后为9216
x = torch.max_pool2d(x, 2)
x = self.dropout1(x)
# 全连接层
x = torch.flatten(x, 1)
x = self.fc1(x)
x = torch.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = torch.log_softmax(x, dim=1)
return output
# 3. 初始化模型
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DigitCNN().to(device) # .to(device)将数据移至GPU/CPU
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()
# 4. 训练函数
def train(model, device, train_loader, optimizer, epoch):
model.train() # train()开启训练模式
train_loss = 0
correct = 0
for batch_idx, (data, target) in enumerate(train_loader): # enumerate同时获取索引和元素
data, target = data.to(device), target.to(device)
optimizer.zero_grad() # 清零梯度,防止梯度累积
output = model(data)
loss = criterion(output, target)
loss.backward() # 反向传播计算梯度
optimizer.step() # 根据梯度更新模型参数
train_loss += loss.item() # .item()将单元素张量转为Python数值
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item() # 链式调用,连续执行多个方法
train_loss /= len(train_loader)
accuracy = 100. * correct / len(train_loader.dataset)
print(f'Train Epoch: {epoch} \tLoss: {train_loss:.4f} \tAccuracy: {accuracy:.2f}%')
# 5. 测试函数
def test(model, device, test_loader):
model.eval() # eval()开启评估模式(关闭Dropout等)
test_loss = 0
correct = 0
with torch.no_grad(): # 禁用梯度计算,节省内存
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += criterion(output, target).item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader)
accuracy = 100. * correct / len(test_loader.dataset)
print(f'Test set: Average loss: {test_loss:.4f}, Accuracy: {accuracy:.2f}%')
# 6. 训练循环
epochs = 5
for epoch in range(1, epochs + 1):
train(model, device, train_loader, optimizer, epoch)
test(model, device, test_loader)
# 7. 可视化预测
model.eval()
with torch.no_grad():
data, target = next(iter(test_loader))
data, target = data.to(device), target.to(device)
output = model(data)
predictions = output.argmax(dim=1)
# 显示前9张图片和预测
fig, axes = plt.subplots(3, 3, figsize=(10, 10))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].cpu().squeeze(), cmap='gray') # squeeze去除大小为1的维度
ax.set_title(f'预测: {predictions[i].item()}, 真实: {target[i].item()}')
ax.axis('off')
plt.tight_layout()
plt.show()
📈 项目4: 客户细分 (K-Means)¶
Python
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# 1. 生成模拟数据
np.random.seed(42)
n_samples = 500
# 特征: 年龄, 收入, 消费分数
age = np.random.randint(18, 70, n_samples)
income = np.random.randint(20000, 150000, n_samples)
spending_score = np.random.randint(1, 100, n_samples)
df = pd.DataFrame({
'Age': age,
'Annual Income (k$)': income // 1000,
'Spending Score (1-100)': spending_score
})
print("数据样本:")
print(df.head())
# 2. 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# 3. 确定最佳K值 (肘部法则)
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K值 (簇数量)')
plt.ylabel('惯性 (SSE)')
plt.title('肘部法则')
plt.show()
# 4. 训练K-Means (假设K=3)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
df['Cluster'] = clusters
# 5. PCA降维可视化
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('主成分1')
plt.ylabel('主成分2')
plt.title('客户聚类 (PCA可视化)')
plt.show()
# 6. 簇分析
print("\n各簇特征:")
for cluster in range(3):
cluster_data = df[df['Cluster'] == cluster]
print(f"\n簇 {cluster}:")
print(f"样本数: {len(cluster_data)}")
print(f"平均年龄: {cluster_data['Age'].mean():.1f}")
print(f"平均收入: {cluster_data['Annual Income (k$)'].mean():.1f}k")
print(f"平均消费分数: {cluster_data['Spending Score (1-100)'].mean():.1f}")
# 7. 业务解读
print("\n商业洞察:")
print("簇0: 高收入高消费 - VIP客户")
print("簇1: 中等收入中等消费 - 普通客户")
print("簇2: 低收入低消费 - 价格敏感客户")
🔑 核心技能清单¶
数据处理¶
Python
# Pandas操作
df.head() # 查看前几行
df.info() # 数据类型和缺失值
df.describe() # 统计摘要
df.isnull().sum() # 缺失值数量
df.dropna() # 删除缺失值
df.fillna(0) # 填充缺失值
# NumPy操作
np.mean(arr) # 均值
np.std(arr) # 标准差
np.corrcoef(arr) # 相关系数
模型训练流程¶
Python
# 1. 数据准备
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 2. 数据预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 3. 模型训练
model = RandomForestClassifier()
model.fit(X_train, y_train)
# 4. 预测
y_pred = model.predict(X_test)
# 5. 评估
from sklearn.metrics import accuracy_score, classification_report
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
# 6. 交叉验证
scores = cross_val_score(model, X, y, cv=5)
print(f"交叉验证准确率: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")
可视化¶
Python
# Matplotlib基础
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.scatter(x, y)
plt.hist(data, bins=30)
plt.bar(categories, values)
plt.xlabel('X轴')
plt.ylabel('Y轴')
plt.title('图表标题')
plt.legend()
plt.show()
📚 学习资源¶
在线课程¶
- Andrew Ng - Machine Learning (Coursera)
- 李沐 - 动手学深度学习 (B站)
- Fast.ai - Practical Deep Learning for Coders
书籍¶
- 《机器学习实战》- Peter Harrington
- 《统计学习方法》- 李航
- 《深度学习》- Ian Goodfellow
- 《动手学深度学习》- 李沐
数据集¶
- Kaggle: www.kaggle.com/datasets
- UCI: archive.ics.uci.edu/ml/datasets
- 天池: tianchi.aliyun.com
练习平台¶
- Kaggle竞赛
- LeetCode
- HackerRank
💡 学习建议¶
避免的陷阱¶
- 过度依赖AutoML: 不理解原理就调包
- 数据泄露: 测试集信息混入训练集
- 过拟合: 在训练集表现好,测试集差
- 忽视基准: 没有建立简单的基准模型
- 盲目调参: 不理解参数含义就网格搜索
推荐学习路径¶
Text Only
第1周: Python基础 + NumPy/Pandas
第2周: 数据可视化 + EDA
第3-4周: 线性回归 + 逻辑回归
第5-6周: 决策树 + 随机森林
第7-8周: SVM + 朴素贝叶斯
第9-10周: K-Means + PCA
第11-12周: CNN基础
每周末: 完成一个实战项目!
每日练习¶
🚀 下一步行动¶
- 选择一个项目开始实践
- 从零实现核心算法
- 参加Kaggle竞赛
- 阅读经典论文
- 建立自己的项目集
记住: 理论 + 实践 + 反思 = 掌握
祝学习顺利! 🎉