06-正则化技术¶

学习时间: 约6-8小时 难度级别: ⭐⭐⭐ 中级 前置知识: 神经网络基础、反向传播算法、损失函数与优化 学习目标: 深入理解各种正则化技术的原理与实现，掌握防止过拟合的实用方法

目录¶

1. 过拟合问题回顾
2. L1 与 L2 正则化
3. Dropout 详解
4. Batch Normalization
5. Layer Normalization
6. 权重初始化策略
7. 数据增强
8. Early Stopping
9. 梯度裁剪
10. 练习与自我检查

1. 过拟合问题回顾¶

1.1 什么是过拟合¶

过拟合（Overfitting）是机器学习中最常见的问题之一。当模型在训练集上表现极好，但在测试集上表现显著下降时，我们说模型发生了过拟合。

从偏差-方差分解（Bias-Variance Decomposition）的角度看：

\[E[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2\]

偏差（Bias）: 模型预测值与真实值的平均偏离程度
方差（Variance）: 模型在不同训练集上预测结果的变化程度
不可约误差: 数据本身的噪声

过拟合意味着模型方差过高 — 它"记住"了训练数据中的噪声，而非学习到数据的真实规律。

1.2 过拟合的信号¶

Python

import matplotlib.pyplot as plt

def plot_overfitting(train_losses, val_losses):
    """可视化过拟合"""
    plt.figure(figsize=(10, 6))
    plt.plot(train_losses, label='训练损失', color='blue')
    plt.plot(val_losses, label='验证损失', color='red')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('过拟合示意图')
    plt.legend()
    plt.grid(True)
    plt.show()

典型信号： - 训练损失持续下降，验证损失先降后升 - 训练准确率接近100%，验证准确率远低于训练准确率 - 模型参数量远超训练样本数

1.3 正则化概述¶

正则化（Regularization）是一组用于减少模型复杂度、防止过拟合的技术。核心思想是：在优化目标中加入对模型复杂度的惩罚。

\[\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \lambda \cdot \Omega(\theta)\]

其中 $\Omega(\theta)$ 是关于参数 $\theta$ 的正则化项，$\lambda$ 控制正则化强度。

2. L1 与 L2 正则化¶

2.1 L2 正则化（权重衰减）¶

L2 正则化在损失函数中添加参数的平方和惩罚：

\[\mathcal{L}_{\text{L2}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \sum_{i} w_i^2\]

梯度更新：

\[\frac{\partial \mathcal{L}_{\text{L2}}}{\partial w_i} = \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + \lambda w_i\]

\[w_i \leftarrow w_i - \eta \left(\frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + \lambda w_i \right) = (1 - \eta\lambda)w_i - \eta \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i}\]

注意到 $(1 - \eta\lambda) < 1$，这意味着每次更新时权重都会"衰减"一点，这就是"权重衰减"名称的由来。

L2 正则化的几何直觉：L2 正则化倾向于让权重均匀地变小，不会让某个权重特别大。从等高线图来看，L2 约束区域是一个圆（超球体），损失函数的等高线与圆的切点就是最优解。

Python

import torch
import torch.nn as nn

# 方法1：使用 PyTorch 的 weight_decay 参数
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# weight_decay 就是 L2 正则化系数 λ
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# 方法2：手动实现 L2 正则化
def l2_regularization(model, lambda_l2=1e-4):
    """手动计算 L2 正则化项"""
    l2_reg = torch.tensor(0.0)
    for param in model.parameters():
        l2_reg += torch.norm(param, p=2) ** 2
    return lambda_l2 * l2_reg

# 训练循环中使用
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, targets) + l2_regularization(model)
        optimizer.zero_grad()  # 清零梯度，防止梯度累积
        loss.backward()  # 反向传播计算梯度
        optimizer.step()  # 根据梯度更新模型参数

2.2 L1 正则化¶

L1 正则化在损失函数中添加参数的绝对值之和：

\[\mathcal{L}_{\text{L1}} = \mathcal{L}_{\text{data}} + \lambda \sum_{i} |w_i|\]

梯度：

\[\frac{\partial \mathcal{L}_{\text{L1}}}{\partial w_i} = \frac{\partial \mathcal{L}_{\text{data}}}{\partial w_i} + \lambda \cdot \text{sign}(w_i)\]

L1 正则化的特点 — 稀疏性：L1 正则化会将一些权重推向精确的零，从而产生稀疏模型。这是因为 L1 的约束区域是一个菱形（超立方体），等高线更容易在菱形的顶点处相切，而顶点处有某些坐标为零。

Python

def l1_regularization(model, lambda_l1=1e-5):
    """手动计算 L1 正则化项"""
    l1_reg = torch.tensor(0.0)
    for param in model.parameters():
        l1_reg += torch.norm(param, p=1)
    return lambda_l1 * l1_reg

# Elastic Net：L1 + L2 结合
def elastic_net_regularization(model, lambda_l1=1e-5, lambda_l2=1e-4):
    """Elastic Net 正则化"""
    l1_reg = torch.tensor(0.0)
    l2_reg = torch.tensor(0.0)
    for param in model.parameters():
        l1_reg += torch.norm(param, p=1)
        l2_reg += torch.norm(param, p=2) ** 2
    return lambda_l1 * l1_reg + lambda_l2 * l2_reg

2.3 L1 vs L2 对比¶

特性	L1 正则化	L2 正则化
惩罚项	$\lambda\sum\lvert w_i\rvert$	$\frac{\lambda}{2}\sum w_i^2$
几何形状	菱形	圆形
稀疏性	产生稀疏权重	权重趋近于零但不为零
特征选择	隐式特征选择	无特征选择
计算效率	$w=0$处不可导	处处可导
适用场景	特征较多且需要稀疏	大多数深度学习场景

3. Dropout 详解¶

3.1 Dropout 原理¶

Dropout（Srivastava et al., 2014）是深度学习中最广泛使用的正则化技术之一。核心思想极其简单：在训练过程中，随机将一部分神经元的输出置零。

对于某一层的输出 $\mathbf{h}$，Dropout 操作如下：

训练阶段： $$r_j \sim \text{Bernoulli}(p)$$ $$\tilde{\mathbf{h}} = \mathbf{r} \odot \mathbf{h}$$

其中 $p$ 是保留概率（keep probability），$\mathbf{r}$ 是与 $\mathbf{h}$ 同形状的随机二值向量，$\odot$ 表示逐元素乘法。

测试阶段：不使用 Dropout，但需要对输出进行缩放： $$\tilde{\mathbf{h}}_{\text{test}} = p \cdot \mathbf{h}$$

3.2 Inverted Dropout（反向 Dropout）¶

实际实现中通常使用 Inverted Dropout，在训练时就进行缩放：

训练阶段： $$\tilde{\mathbf{h}} = \frac{1}{p} \cdot \mathbf{r} \odot \mathbf{h}$$

测试阶段：不做任何变化 $$\tilde{\mathbf{h}}_{\text{test}} = \mathbf{h}$$

这样测试时无需修改代码，更加方便。

3.3 Dropout 为什么有效¶

集成学习的近似：每次 Dropout 相当于训练一个不同的子网络。$n$ 个神经元有 $2^n$ 种可能的子网络，最终的预测相当于对这些子网络的集成。
破坏特征的共适应（Co-adaptation）：防止神经元之间形成强依赖关系，迫使每个神经元学习更鲁棒的特征。
噪声注入：类似于给模型加入了噪声，增强了模型的泛化能力。

3.4 PyTorch 实现¶

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

# ===== 手动实现 Dropout =====
class ManualDropout(nn.Module):
    def __init__(self, p=0.5):  # __init__构造方法，创建对象时自动调用
        """
        Args:
            p: 丢弃概率（注意：PyTorch 的 Dropout 参数 p 是丢弃概率，不是保留概率）
        """
        super().__init__()  # super()调用父类方法
        self.p = p

    def forward(self, x):
        if self.training and self.p > 0:
            # 生成 Bernoulli 掩码（保留概率为 1-p）
            keep_prob = 1 - self.p
            mask = torch.bernoulli(torch.full_like(x, keep_prob))
            # Inverted Dropout：训练时缩放
            return x * mask / keep_prob
        return x

# ===== 使用 PyTorch 内置 Dropout =====
class DropoutNet(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, num_classes=10, dropout_rate=0.5):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.dropout1 = nn.Dropout(p=dropout_rate)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.dropout2 = nn.Dropout(p=dropout_rate)
        self.fc3 = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

# 使用示例
model = DropoutNet()

# 训练时：开启 Dropout
model.train()
output_train = model(torch.randn(32, 784))

# 评估时：关闭 Dropout
model.eval()
output_eval = model(torch.randn(32, 784))

# ===== 2D Dropout（用于 CNN）=====
class CNNWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        # Dropout2d 会丢弃整个特征图通道
        self.dropout2d = nn.Dropout2d(p=0.25)
        self.fc1 = nn.Linear(64 * 8 * 8, 256)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.dropout2d(x)
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout2d(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

3.5 Dropout 的变体¶

变体	说明
Spatial Dropout	丢弃整个特征图通道，用于 CNN
DropConnect	丢弃权重而非激活值
DropBlock	丢弃特征图中的连续区域
Variational Dropout	保持同一 Dropout 掩码跨时间步
Concrete Dropout	自动学习 Dropout 率

4. Batch Normalization¶

4.1 Internal Covariate Shift¶

Batch Normalization（BN）由 Ioffe & Szegedy（2015）提出，旨在解决内部协变量偏移（Internal Covariate Shift）问题 — 随着训练进行，每一层的输入分布不断变化，导致训练困难。

注：后续研究（Santurkar et al., 2018）表明 BN 的成功可能更多归因于平滑了损失曲面，而非解决 Internal Covariate Shift。

4.2 BN 前向传播¶

给定一个 mini-batch $\mathcal{B} = \{x_1, x_2, \ldots, x_m\}$：

Step 1: 计算 mini-batch 均值 $$\mu_{\mathcal{B}} = \frac{1}{m} \sum_{i=1}^{m} x_i$$

Step 2: 计算 mini-batch 方差 $$\sigma_{\mathcal{B}}^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{\mathcal{B}})^2$$

Step 3: 标准化 $$\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}$$

Step 4: 缩放和平移（可学习参数 $\gamma, \beta$） $$y_i = \gamma \hat{x}_i + \beta$$

其中 $\epsilon$ 是为了数值稳定性的小常数（如 $10^{-5}$），$\gamma$ 和 $\beta$ 是可学习的参数。

4.3 BN 反向传播推导¶

设 $\mathcal{L}$ 是损失函数，已知 $\frac{\partial \mathcal{L}}{\partial y_i}$，需要求 $\frac{\partial \mathcal{L}}{\partial x_i}$、$\frac{\partial \mathcal{L}}{\partial \gamma}$、$\frac{\partial \mathcal{L}}{\partial \beta}$。

对 $\gamma$ 和 $\beta$ 的梯度：

\[\frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial y_i} \cdot \hat{x}_i\]

\[\frac{\partial \mathcal{L}}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial y_i}\]

对 $\hat{x}_i$ 的梯度：

\[\frac{\partial \mathcal{L}}{\partial \hat{x}_i} = \frac{\partial \mathcal{L}}{\partial y_i} \cdot \gamma\]

对 $\sigma_{\mathcal{B}}^2$ 的梯度：

\[\frac{\partial \mathcal{L}}{\partial \sigma_{\mathcal{B}}^2} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial \hat{x}_i} \cdot (x_i - \mu_{\mathcal{B}}) \cdot \left(-\frac{1}{2}\right) (\sigma_{\mathcal{B}}^2 + \epsilon)^{-3/2}\]

对 $\mu_{\mathcal{B}}$ 的梯度：

\[\frac{\partial \mathcal{L}}{\partial \mu_{\mathcal{B}}} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} + \frac{\partial \mathcal{L}}{\partial \sigma_{\mathcal{B}}^2} \cdot \frac{-2}{m} \sum_{i=1}^{m}(x_i - \mu_{\mathcal{B}})\]

对 $x_i$ 的梯度：

\[\frac{\partial \mathcal{L}}{\partial x_i} = \frac{\partial \mathcal{L}}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} + \frac{\partial \mathcal{L}}{\partial \sigma_{\mathcal{B}}^2} \cdot \frac{2(x_i - \mu_{\mathcal{B}})}{m} + \frac{\partial \mathcal{L}}{\partial \mu_{\mathcal{B}}} \cdot \frac{1}{m}\]

4.4 训练与推理的区别¶

训练: 使用当前 mini-batch 的均值和方差
推理: 使用训练过程中累积的移动平均（Running Mean / Running Variance）

\[\mu_{\text{running}} = (1-\alpha) \cdot \mu_{\text{running}} + \alpha \cdot \mu_{\mathcal{B}}$$ $$\sigma^2_{\text{running}} = (1-\alpha) \cdot \sigma^2_{\text{running}} + \alpha \cdot \sigma^2_{\mathcal{B}}\]

其中 $\alpha$ 是动量系数（PyTorch 默认为 0.1）。

4.5 PyTorch 实现¶

Python

import torch
import torch.nn as nn

# ===== 手动实现 Batch Normalization =====
class ManualBatchNorm1d(nn.Module):
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super().__init__()
        self.num_features = num_features
        self.eps = eps
        self.momentum = momentum

        # 可学习参数
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta = nn.Parameter(torch.zeros(num_features))

        # 移动平均统计量（不参与梯度计算）
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))

    def forward(self, x):
        if self.training:
            # 训练模式：使用 batch 统计量
            mean = x.mean(dim=0)
            var = x.var(dim=0, unbiased=False)

            # 更新移动平均
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
        else:
            # 推理模式：使用移动平均
            mean = self.running_mean
            var = self.running_var

        # 标准化
        x_hat = (x - mean) / torch.sqrt(var + self.eps)
        # 缩放和平移
        out = self.gamma * x_hat + self.beta
        return out

# ===== 2D Batch Normalization（用于 CNN）手动实现 =====
class ManualBatchNorm2d(nn.Module):
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super().__init__()
        self.num_features = num_features
        self.eps = eps
        self.momentum = momentum

        self.gamma = nn.Parameter(torch.ones(1, num_features, 1, 1))
        self.beta = nn.Parameter(torch.zeros(1, num_features, 1, 1))

        self.register_buffer('running_mean', torch.zeros(1, num_features, 1, 1))
        self.register_buffer('running_var', torch.ones(1, num_features, 1, 1))

    def forward(self, x):
        # x shape: (N, C, H, W)
        if self.training:
            mean = x.mean(dim=(0, 2, 3), keepdim=True)
            var = x.var(dim=(0, 2, 3), keepdim=True, unbiased=False)
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var

        x_hat = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_hat + self.beta

# ===== 使用 BN 的完整网络 =====
class BNNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        return self.layers(x.view(x.size(0), -1))

# 测试手动实现
manual_bn = ManualBatchNorm1d(256)
pytorch_bn = nn.BatchNorm1d(256)

x = torch.randn(32, 256)
manual_bn.train()
pytorch_bn.train()
print("手动实现输出 shape:", manual_bn(x).shape)
print("PyTorch 实现输出 shape:", pytorch_bn(x).shape)

4.6 BN 的优缺点¶

优点： - 加速训练收敛 - 允许使用更大的学习率 - 减少对权重初始化的敏感性 - 具有一定的正则化效果

缺点： - 依赖 batch size（小 batch 效果差） - 训练和推理行为不一致 - 在 RNN 中使用困难 - 增加了推理时的计算量

5. Layer Normalization¶

5.1 原理¶

Layer Normalization（Ba et al., 2016）针对 BN 的缺点提出改进。不同于 BN 对 batch 维度做归一化，LN 对每个样本的特征维度做归一化。

对于输入 $x \in \mathbb{R}^{H}$（$H$ 是隐藏层维度）：

\[\mu = \frac{1}{H} \sum_{i=1}^{H} x_i\]

\[\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2\]

\[\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}\]

\[y_i = \gamma_i \hat{x}_i + \beta_i\]

5.2 BN vs LN 对比¶

特性	Batch Normalization	Layer Normalization
归一化维度	batch 维度	特征维度
依赖 batch size	是	否
训练/推理一致	否	是
适用 RNN	困难	自然适用
适用 Transformer	不常用	标准配置
需要移动平均	是	否

Python

# PyTorch 中 Layer Normalization
class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, nhead=8, dim_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, dim_ff),
            nn.ReLU(),
            nn.Linear(dim_ff, d_model)
        )
        # Transformer 中使用 LayerNorm
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Pre-Norm 结构
        attn_out, _ = self.self_attn(self.norm1(x), self.norm1(x), self.norm1(x))
        x = x + self.dropout(attn_out)
        ffn_out = self.ffn(self.norm2(x))
        x = x + self.dropout(ffn_out)
        return x

5.3 其他归一化方法¶

Python

# Instance Normalization：对每个样本的每个通道做归一化（风格迁移中常用）
instance_norm = nn.InstanceNorm2d(64)

# Group Normalization：将通道分组做归一化
group_norm = nn.GroupNorm(num_groups=32, num_channels=64)

# RMS Normalization：只使用均方根进行归一化（LLaMA 中使用）
class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

6. 权重初始化策略¶

6.1 为什么初始化很重要¶

如果权重初始化不当： - 全零初始化: 所有神经元学到相同的特征（对称性问题） - 过大初始化: 激活值爆炸，梯度爆炸 - 过小初始化: 激活值趋向零，梯度消失

目标：保持每层的激活值和梯度的方差稳定。

6.2 Xavier 初始化（Glorot 初始化）¶

适用于 Sigmoid/Tanh 激活函数。

对于有 $n_{in}$ 个输入和 $n_{out}$ 个输出的层：

均匀分布版本： $$W \sim U\left[-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right]$$

正态分布版本： $$W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)$$

推导关键思想：假设激活函数在零点附近近似线性，要求前向传播中 $\text{Var}(y) = \text{Var}(x)$，以及反向传播中梯度方差保持不变。

6.3 He 初始化（Kaiming 初始化）¶

适用于 ReLU 及其变体激活函数。

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]

推导：由于 ReLU 会将约一半的激活值置零，方差会减半，因此需要将 Xavier 初始化中的分母乘以 2 来补偿。

Python

import torch.nn as nn
import torch.nn.init as init

class InitializedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

        # 应用初始化
        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):  # isinstance检查对象类型
                # Xavier 初始化（适用于 Sigmoid/Tanh）
                # init.xavier_uniform_(m.weight)
                # init.xavier_normal_(m.weight)

                # Kaiming/He 初始化（适用于 ReLU）
                init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')

                if m.bias is not None:
                    init.zeros_(m.bias)

            elif isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    init.zeros_(m.bias)

            elif isinstance(m, nn.BatchNorm2d):
                init.ones_(m.weight)
                init.zeros_(m.bias)

# 验证初始化效果
def check_activation_stats(model, x):
    """检查各层激活值的统计量"""
    activations = []
    hooks = []

    def hook_fn(module, input, output):
        activations.append(output.detach())  # detach()从计算图分离，不参与梯度计算

    for layer in model.children():
        hooks.append(layer.register_forward_hook(hook_fn))

    with torch.no_grad():
        model(x)

    for i, act in enumerate(activations):  # enumerate同时获取索引和元素
        print(f"Layer {i}: mean={act.mean():.4f}, std={act.std():.4f}, "
              f"dead_ratio={(act == 0).float().mean():.4f}")  # 链式调用，连续执行多个方法

    for h in hooks:
        h.remove()

6.4 初始化方法选择指南¶

激活函数	推荐初始化	PyTorch 函数
Sigmoid / Tanh	Xavier	`init.xavier_normal_`
ReLU	Kaiming (fan_in)	`init.kaiming_normal_(w, mode='fan_in')`
Leaky ReLU	Kaiming (适配斜率)	`init.kaiming_normal_(w, a=0.01)`
SELU	LeCun	`init.normal_(w, std=1/sqrt(n))`
Transformer	小正态	`init.normal_(w, std=0.02)`

7. 数据增强¶

7.1 概述¶

数据增强（Data Augmentation）通过对训练数据应用各种变换来"增加"数据量，是最有效的正则化方法之一。

Python

import torchvision.transforms as transforms

# ===== 基础数据增强 =====
basic_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomVerticalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomGrayscale(p=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# ===== 高级数据增强 =====
advanced_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.08, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomApply([
        transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)
    ], p=0.8),
    transforms.RandomGrayscale(p=0.2),
    transforms.RandomApply([
        transforms.GaussianBlur(kernel_size=23, sigma=(0.1, 2.0))
    ], p=0.5),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.25),
])

7.2 Mixup 和 CutMix¶

Python

import numpy as np

def mixup_data(x, y, alpha=1.0):
    """Mixup 数据增强"""
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1

    batch_size = x.size(0)
    index = torch.randperm(batch_size).to(x.device)

    mixed_x = lam * x + (1 - lam) * x[index]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def mixup_criterion(criterion, pred, y_a, y_b, lam):
    """Mixup 损失函数"""
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

def cutmix_data(x, y, alpha=1.0):
    """CutMix 数据增强"""
    lam = np.random.beta(alpha, alpha)
    batch_size = x.size(0)
    index = torch.randperm(batch_size).to(x.device)

    # 生成随机框
    W, H = x.size(2), x.size(3)
    cut_ratio = np.sqrt(1 - lam)
    cut_w = int(W * cut_ratio)
    cut_h = int(H * cut_ratio)

    cx = np.random.randint(W)
    cy = np.random.randint(H)

    x1 = np.clip(cx - cut_w // 2, 0, W)
    y1 = np.clip(cy - cut_h // 2, 0, H)
    x2 = np.clip(cx + cut_w // 2, 0, W)
    y2 = np.clip(cy + cut_h // 2, 0, H)

    mixed_x = x.clone()
    mixed_x[:, :, x1:x2, y1:y2] = x[index, :, x1:x2, y1:y2]

    # 调整 lambda
    lam = 1 - ((x2 - x1) * (y2 - y1) / (W * H))
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

8. Early Stopping¶

8.1 原理¶

Early Stopping 通过监控验证集性能来决定何时停止训练。当验证集性能在连续若干个 epoch 内不再提升时，停止训练。

Python

class EarlyStopping:
    """早停机制"""
    def __init__(self, patience=10, min_delta=0.0, mode='min', verbose=True):
        """
        Args:
            patience: 在多少个 epoch 内验证指标未改善时停止
            min_delta: 被认为是改善的最小变化量
            mode: 'min' 表示指标越小越好（如 loss），'max' 表示越大越好（如 accuracy）
        """
        self.patience = patience
        self.min_delta = min_delta
        self.mode = mode
        self.verbose = verbose
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.best_model_state = None

    def __call__(self, score, model):  # __call__使实例可像函数一样被调用
        import copy
        if self.best_score is None:
            self.best_score = score
            self.best_model_state = copy.deepcopy(model.state_dict())
        elif self._is_improvement(score):
            self.best_score = score
            self.best_model_state = copy.deepcopy(model.state_dict())
            self.counter = 0
            if self.verbose:
                print(f"EarlyStopping: 指标改善到 {score:.6f}")
        else:
            self.counter += 1
            if self.verbose:
                print(f"EarlyStopping: {self.counter}/{self.patience}")
            if self.counter >= self.patience:
                self.early_stop = True

    def _is_improvement(self, score):
        if self.mode == 'min':
            return score < self.best_score - self.min_delta
        else:
            return score > self.best_score + self.min_delta

    def load_best_model(self, model):
        model.load_state_dict(self.best_model_state)

# ===== 使用示例 =====
early_stopping = EarlyStopping(patience=10, mode='min')

for epoch in range(max_epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer, criterion)
    val_loss = evaluate(model, val_loader, criterion)

    early_stopping(val_loss, model)
    if early_stopping.early_stop:
        print(f"Early stopping at epoch {epoch}")
        break

# 加载最佳模型
early_stopping.load_best_model(model)

9. 梯度裁剪¶

9.1 问题背景¶

在训练深度网络（特别是 RNN）时，梯度可能变得非常大（梯度爆炸），导致训练不稳定。梯度裁剪通过限制梯度的大小来解决这个问题。

9.2 两种梯度裁剪方法¶

按值裁剪（Clip by Value）：

\[g_i \leftarrow \text{clip}(g_i, -\text{threshold}, \text{threshold})\]

按范数裁剪（Clip by Norm）：

\[\mathbf{g} \leftarrow \frac{\text{threshold}}{\|\mathbf{g}\|} \cdot \mathbf{g} \quad \text{if } \|\mathbf{g}\| > \text{threshold}\]

按范数裁剪更常用，因为它保持了梯度方向不变。

Python

import torch.nn.utils as utils

# ===== 按范数裁剪（推荐）=====
max_grad_norm = 1.0

loss.backward()
# 裁剪所有参数的梯度
total_norm = utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)
print(f"梯度范数: {total_norm:.4f}")
optimizer.step()

# ===== 按值裁剪 =====
max_grad_value = 0.5

loss.backward()
utils.clip_grad_value_(model.parameters(), clip_value=max_grad_value)
optimizer.step()

# ===== 监控梯度范数 =====
def monitor_gradients(model):
    """监控模型中各层的梯度范数"""
    total_norm = 0
    for name, param in model.named_parameters():
        if param.grad is not None:
            param_norm = param.grad.data.norm(2)
            total_norm += param_norm.item() ** 2  # .item()将单元素张量转为Python数值
            print(f"{name}: grad_norm = {param_norm:.6f}")
    total_norm = total_norm ** 0.5
    print(f"Total gradient norm: {total_norm:.6f}")
    return total_norm

10. 练习与自我检查¶

练习题¶

L1/L2 正则化：在 MNIST 数据集上训练一个全连接网络，分别尝试无正则化、L1、L2 正则化，比较测试准确率和权重分布。
Dropout 实验：实现一个带 Dropout 的网络，尝试 Dropout 率为 0.1、0.3、0.5、0.7 时的效果，观察过拟合程度的变化。
BN 手动实现：完成 BatchNorm 的反向传播代码，并与 PyTorch 内置实现对比梯度值。
权重初始化：分别用全零、随机均匀、Xavier、Kaiming 初始化训练同一个网络，记录训练过程中各层激活值的均值和标准差。
Early Stopping：实现完整的训练流程，包含 Early Stopping 和模型保存/加载。
综合实验：在 CIFAR-10 上构建 CNN，综合使用 BN + Dropout + 数据增强 + 权重衰减 + 梯度裁剪，与不使用正则化的基线对比。

自我检查清单¶

能解释过拟合的原因和识别方法
理解 L1/L2 正则化的区别及各自的几何直觉
能手动实现 Dropout 的训练和推理逻辑
理解 Batch Normalization 的前向传播和反向传播
能区分 BN、LN、IN、GN 的归一化维度
了解 Xavier 和 He 初始化的推导思路
能实现 Early Stopping 和梯度裁剪
理解 Mixup 和 CutMix 的原理和代码实现
能在实际项目中综合选择和使用正则化技术

下一篇: 07-优化器进阶 — 深入学习各种优化算法

特性	L1 正则化	L2 正则化
惩罚项	\(\lambda\sum\lvert w_i\rvert\)	\(\frac{\lambda}{2}\sum w_i^2\)
几何形状	菱形	圆形
稀疏性	产生稀疏权重	权重趋近于零但不为零
特征选择	隐式特征选择	无特征选择
计算效率	\(w=0\)处不可导	处处可导
适用场景	特征较多且需要稀疏	大多数深度学习场景