跳转至

06 - 机器学习实践指南

机器学习实践流程图

🎯 实践项目总览

项目列表 (难度递增)

Text Only
入门级:
1. 加州房价预测 (线性回归)
2. 鸢尾花分类 (逻辑回归/SVM)

进阶级:
3. 泰坦尼克号生存预测 (随机森林)
4. 手写数字识别 (CNN)
5. 客户细分 (K-Means)

高级:
6. 股票价格预测 (LSTM)
7. 图像分类 (迁移学习)
8. 新闻分类 (NLP)

📊 项目1: 加州房价预测

目标

预测加州房屋的中位数价格

数据集

  • 样本数: 20,640
  • 特征数: 8 (房屋年龄、房间数、卧室数、人口、家庭数、收入、纬度、经度)
  • 目标: 房价 (万美元)

重要说明: 原本使用的波士顿房价数据集由于包含种族歧视特征,已被 scikit-learn 官方弃用。我们改用 California Housing 数据集,这是一个更符合伦理标准且同样适合回归任务的数据集。

完整代码

Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

# 1. 加载数据
california = fetch_california_housing()
X = california.data
y = california.target
feature_names = california.feature_names

print(f"数据形状: {X.shape}")
print(f"特征名: {feature_names}")

# 2. 探索性数据分析 (EDA)
df = pd.DataFrame(X, columns=feature_names)
df['Price'] = y

# 查看数据
print(df.head())

# 相关性分析
corr = df.corr()
print("与房价相关性:")
print(corr['Price'].sort_values(ascending=False))

# 可视化: 房间数 vs 房价
plt.scatter(df['AveRooms'], df['Price'])
plt.xlabel('平均房间数 (AveRooms)')
plt.ylabel('房价 (Price)')
plt.title('平均房间数 vs 房价')
plt.show()

# 3. 数据预处理
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 模型训练
# 线性回归
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Ridge回归 (L2正则化)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Lasso回归 (L1正则化)
lasso = Lasso(alpha=0.01)  # alpha=1.0过大会将大部分系数压为0
lasso.fit(X_train_scaled, y_train)

# 5. 预测与评估
def evaluate_model(model, X_test, y_test, name):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"\n{name}:")
    print(f"MSE: {mse:.2f}")
    print(f"R²: {r2:.4f}")
    return y_pred

y_pred_lr = evaluate_model(lr, X_test_scaled, y_test, "线性回归")
y_pred_ridge = evaluate_model(ridge, X_test_scaled, y_test, "Ridge回归")
y_pred_lasso = evaluate_model(lasso, X_test_scaled, y_test, "Lasso回归")

# 6. 可视化预测结果
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lr, alpha=0.5)
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('真实房价 (万美元)')
plt.ylabel('预测房价 (万美元)')
plt.title('真实 vs 预测房价')
plt.show()

# 7. 特征重要性
feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': lr.coef_
}).sort_values('Coefficient', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'],
         feature_importance['Coefficient'])
plt.title('特征重要性 (回归系数)')
plt.show()

练习

  1. 尝试不同的alpha值 (正则化强度)
  2. 添加多项式特征 (非线性关系)
  3. 使用其他算法 (随机森林、XGBoost)

🚢 项目2: 泰坦尼克号生存预测

目标

预测乘客是否在泰坦尼克号沉船事故中生存

数据集

  • 训练集: 891条
  • 特征: 姓名、性别、年龄、船票等级等
  • 目标: 0 (死亡) / 1 (生存)

完整代码

Python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 1. 加载数据
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("训练集形状:", train_df.shape)
print(train_df.head())

# 2. 特征工程
def feature_engineering(df):
    # 复制数据
    df = df.copy()

    # 提取称谓 (Mr, Mrs, Miss等)
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(
        ['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr',
         'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare'
    )
    df['Title'] = df['Title'].replace('Mlle', 'Miss')
    df['Title'] = df['Title'].replace('Ms', 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')

    # 家庭大小
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

    # 填充缺失值 (避免使用 inplace=True,现代 pandas 推荐赋值写法)
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())

    return df

# 应用特征工程
train_df = feature_engineering(train_df)
test_df = feature_engineering(test_df)

# 3. 特征选择
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
            'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone']

X = train_df[features]
y = train_df['Survived']

# 4. 编码分类变量(每列独立编码,避免共用同一个LabelEncoder导致映射混乱)
for col in ['Sex', 'Embarked', 'Title']:
    X[col] = LabelEncoder().fit_transform(X[col])

# 5. 划分数据集
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 6. 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# 7. 模型训练与对比
models = {
    '逻辑回归': LogisticRegression(max_iter=1000),
    '随机森林': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)
    acc = accuracy_score(y_val, y_pred)
    results[name] = acc
    print(f"\n{name}:")
    print(f"准确率: {acc:.4f}")
    print(f"分类报告:\n{classification_report(y_val, y_pred)}")

# 8. 选择最佳模型
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]
print(f"\n最佳模型: {best_model_name}")

# 9. 混淆矩阵可视化
y_pred = best_model.predict(X_val_scaled)
cm = confusion_matrix(y_val, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'混淆矩阵 - {best_model_name}')
plt.ylabel('真实标签')
plt.xlabel('预测标签')
plt.show()

# 10. 特征重要性 (仅随机森林)
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'Feature': features,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)

    plt.figure(figsize=(10, 6))
    plt.barh(feature_importance['Feature'],
             feature_importance['Importance'])
    plt.title('特征重要性')
    plt.show()

🔢 项目3: 手写数字识别 (CNN)

目标

识别手写数字 (0-9)

数据集

  • MNIST: 70,000张28×28灰度图

完整代码

Python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

# 1. 数据准备
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST均值标准差
])

train_dataset = datasets.MNIST('./data', train=True,
                               download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False,
                              download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)  # DataLoader批量加载数据,支持shuffle和多进程
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

# 2. 定义CNN模型
class DigitCNN(nn.Module):
    def __init__(self):  # __init__构造方法,创建对象时自动调用
        super().__init__()  # super()调用父类方法
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        # 卷积层1: 1x28x28 → 32x26x26
        x = self.conv1(x)
        x = torch.relu(x)

        # 卷积层2: 32x26x26 → 64x24x24
        x = self.conv2(x)
        x = torch.relu(x)
        # 池化: 64x24x24 → 64x12x12, flatten后为9216
        x = torch.max_pool2d(x, 2)
        x = self.dropout1(x)

        # 全连接层
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = torch.log_softmax(x, dim=1)
        return output

# 3. 初始化模型
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DigitCNN().to(device)  # .to(device)将数据移至GPU/CPU
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()

# 4. 训练函数
def train(model, device, train_loader, optimizer, epoch):
    model.train()  # train()开启训练模式
    train_loss = 0
    correct = 0

    for batch_idx, (data, target) in enumerate(train_loader):  # enumerate同时获取索引和元素
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()  # 清零梯度,防止梯度累积
        output = model(data)
        loss = criterion(output, target)
        loss.backward()  # 反向传播计算梯度
        optimizer.step()  # 根据梯度更新模型参数

        train_loss += loss.item()  # .item()将单元素张量转为Python数值
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()  # 链式调用,连续执行多个方法

    train_loss /= len(train_loader)
    accuracy = 100. * correct / len(train_loader.dataset)
    print(f'Train Epoch: {epoch} \tLoss: {train_loss:.4f} \tAccuracy: {accuracy:.2f}%')

# 5. 测试函数
def test(model, device, test_loader):
    model.eval()  # eval()开启评估模式(关闭Dropout等)
    test_loss = 0
    correct = 0

    with torch.no_grad():  # 禁用梯度计算,节省内存
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += criterion(output, target).item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader)
    accuracy = 100. * correct / len(test_loader.dataset)
    print(f'Test set: Average loss: {test_loss:.4f}, Accuracy: {accuracy:.2f}%')

# 6. 训练循环
epochs = 5
for epoch in range(1, epochs + 1):
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader)

# 7. 可视化预测
model.eval()
with torch.no_grad():
    data, target = next(iter(test_loader))
    data, target = data.to(device), target.to(device)
    output = model(data)
    predictions = output.argmax(dim=1)

    # 显示前9张图片和预测
    fig, axes = plt.subplots(3, 3, figsize=(10, 10))
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].cpu().squeeze(), cmap='gray')  # squeeze去除大小为1的维度
        ax.set_title(f'预测: {predictions[i].item()}, 真实: {target[i].item()}')
        ax.axis('off')
    plt.tight_layout()
    plt.show()

📈 项目4: 客户细分 (K-Means)

Python
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 1. 生成模拟数据
np.random.seed(42)
n_samples = 500

# 特征: 年龄, 收入, 消费分数
age = np.random.randint(18, 70, n_samples)
income = np.random.randint(20000, 150000, n_samples)
spending_score = np.random.randint(1, 100, n_samples)

df = pd.DataFrame({
    'Age': age,
    'Annual Income (k$)': income // 1000,
    'Spending Score (1-100)': spending_score
})

print("数据样本:")
print(df.head())

# 2. 数据标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# 3. 确定最佳K值 (肘部法则)
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K值 (簇数量)')
plt.ylabel('惯性 (SSE)')
plt.title('肘部法则')
plt.show()

# 4. 训练K-Means (假设K=3)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

df['Cluster'] = clusters

# 5. PCA降维可视化
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('主成分1')
plt.ylabel('主成分2')
plt.title('客户聚类 (PCA可视化)')
plt.show()

# 6. 簇分析
print("\n各簇特征:")
for cluster in range(3):
    cluster_data = df[df['Cluster'] == cluster]
    print(f"\n{cluster}:")
    print(f"样本数: {len(cluster_data)}")
    print(f"平均年龄: {cluster_data['Age'].mean():.1f}")
    print(f"平均收入: {cluster_data['Annual Income (k$)'].mean():.1f}k")
    print(f"平均消费分数: {cluster_data['Spending Score (1-100)'].mean():.1f}")

# 7. 业务解读
print("\n商业洞察:")
print("簇0: 高收入高消费 - VIP客户")
print("簇1: 中等收入中等消费 - 普通客户")
print("簇2: 低收入低消费 - 价格敏感客户")

🔑 核心技能清单

数据处理

Python
# Pandas操作
df.head()          # 查看前几行
df.info()          # 数据类型和缺失值
df.describe()      # 统计摘要
df.isnull().sum()  # 缺失值数量
df.dropna()        # 删除缺失值
df.fillna(0)       # 填充缺失值

# NumPy操作
np.mean(arr)       # 均值
np.std(arr)        # 标准差
np.corrcoef(arr)   # 相关系数

模型训练流程

Python
# 1. 数据准备
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 2. 数据预处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. 模型训练
model = RandomForestClassifier()
model.fit(X_train, y_train)

# 4. 预测
y_pred = model.predict(X_test)

# 5. 评估
from sklearn.metrics import accuracy_score, classification_report
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# 6. 交叉验证
scores = cross_val_score(model, X, y, cv=5)
print(f"交叉验证准确率: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")

可视化

Python
# Matplotlib基础
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.scatter(x, y)
plt.hist(data, bins=30)
plt.bar(categories, values)
plt.xlabel('X轴')
plt.ylabel('Y轴')
plt.title('图表标题')
plt.legend()
plt.show()

📚 学习资源

在线课程

  • Andrew Ng - Machine Learning (Coursera)
  • 李沐 - 动手学深度学习 (B站)
  • Fast.ai - Practical Deep Learning for Coders

书籍

  • 《机器学习实战》- Peter Harrington
  • 《统计学习方法》- 李航
  • 《深度学习》- Ian Goodfellow
  • 《动手学深度学习》- 李沐

数据集

练习平台

  • Kaggle竞赛
  • LeetCode
  • HackerRank

💡 学习建议

避免的陷阱

  1. 过度依赖AutoML: 不理解原理就调包
  2. 数据泄露: 测试集信息混入训练集
  3. 过拟合: 在训练集表现好,测试集差
  4. 忽视基准: 没有建立简单的基准模型
  5. 盲目调参: 不理解参数含义就网格搜索

推荐学习路径

Text Only
第1周: Python基础 + NumPy/Pandas
第2周: 数据可视化 + EDA
第3-4周: 线性回归 + 逻辑回归
第5-6周: 决策树 + 随机森林
第7-8周: SVM + 朴素贝叶斯
第9-10周: K-Means + PCA
第11-12周: CNN基础

每周末: 完成一个实战项目!

每日练习

Python
# 今天: 手写线性回归
# 明天: 用Scikit-learn实现
# 后天: 尝试不同数据集
# 大后天: 阅读源码

持续实践,不要只看不练!

🚀 下一步行动

  1. 选择一个项目开始实践
  2. 从零实现核心算法
  3. 参加Kaggle竞赛
  4. 阅读经典论文
  5. 建立自己的项目集

记住: 理论 + 实践 + 反思 = 掌握

祝学习顺利! 🎉