03-CNN实战与技巧¶
📌 章节定位:本文档隶属于深度学习教程体系,侧重CNN训练技巧的原理与通用方法论。 - 本文档重点:迁移学习的理论基础、数据增强的数学原理、模型集成的理论依据、CAM可视化的算法原理 - 应用实践方向:如需了解CNN在具体CV任务(图像分类、目标检测、图像分割等)中的完整项目实战,请参考 计算机视觉/实战项目/ 目录
学习时间: 约6-8小时 难度级别: ⭐⭐⭐⭐ 中高级 前置知识: 卷积神经网络基础、经典CNN架构、PyTorch 学习目标: 掌握迁移学习、高级数据增强、模型集成、CAM可视化等实战技巧
目录¶
1. 迁移学习¶
1.1 什么是迁移学习¶
迁移学习(Transfer Learning)是指将在大规模数据集(如 ImageNet)上预训练好的模型应用到新任务上。核心假设:低层特征(边缘、纹理)具有通用性,可以跨任务复用。
1.2 两种迁移学习策略¶
特征提取(Feature Extraction)¶
冻结预训练模型的卷积层,只训练新的分类头:
Python
import torch
import torch.nn as nn
import torchvision.models as models
def create_feature_extractor(num_classes=10):
"""特征提取模式:冻结backbone,只训练分类头"""
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# 冻结所有参数
for param in model.parameters():
param.requires_grad = False
# 替换分类头(这些参数会被训练)
num_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(num_features, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, num_classes)
)
return model
# 只优化分类头的参数
model = create_feature_extractor(num_classes=10)
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
微调(Fine-tuning)¶
先训练分类头,再逐步解冻预训练层进行微调:
Python
def create_finetuning_model(num_classes=10):
"""微调模式:分阶段解冻"""
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# 替换分类头
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)
return model
def finetune_training(model, train_loader, val_loader, device):
"""分阶段微调训练"""
# === 阶段1:冻结backbone,训练分类头 ===
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
print("阶段1: 训练分类头...")
for epoch in range(5):
train_one_epoch(model, train_loader, optimizer, device)
# === 阶段2:解冻layer4,微调 ===
for param in model.layer4.parameters():
param.requires_grad = True
optimizer = torch.optim.Adam([
{'params': model.layer4.parameters(), 'lr': 1e-4},
{'params': model.fc.parameters(), 'lr': 1e-3},
])
print("阶段2: 微调layer4 + 分类头...")
for epoch in range(10):
train_one_epoch(model, train_loader, optimizer, device)
# === 阶段3:解冻全部,低学习率微调 ===
for param in model.parameters():
param.requires_grad = True
optimizer = torch.optim.Adam([
{'params': model.conv1.parameters(), 'lr': 1e-6},
{'params': model.layer1.parameters(), 'lr': 1e-5},
{'params': model.layer2.parameters(), 'lr': 1e-5},
{'params': model.layer3.parameters(), 'lr': 5e-5},
{'params': model.layer4.parameters(), 'lr': 1e-4},
{'params': model.fc.parameters(), 'lr': 5e-4},
])
print("阶段3: 全部微调...")
for epoch in range(15):
train_one_epoch(model, train_loader, optimizer, device)
def train_one_epoch(model, loader, optimizer, device):
model.train() # train()开启训练模式
criterion = nn.CrossEntropyLoss()
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device) # .to(device)将数据移至GPU/CPU
optimizer.zero_grad() # 清零梯度,防止梯度累积
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward() # 反向传播计算梯度
optimizer.step() # 根据梯度更新模型参数
1.3 何时选择哪种策略¶
| 条件 | 推荐策略 |
|---|---|
| 数据量小 + 与ImageNet相似 | 特征提取 |
| 数据量小 + 与ImageNet不同 | 特征提取(可能效果一般) |
| 数据量大 + 与ImageNet相似 | 微调 |
| 数据量大 + 与ImageNet不同 | 微调(可能需要更深的解冻) |
2. 高级数据增强¶
2.1 Mixup¶
Zhang et al.(2018):对两个样本进行线性插值:
\[\tilde{x} = \lambda x_i + (1-\lambda) x_j$$ $$\tilde{y} = \lambda y_i + (1-\lambda) y_j\]
其中 \(\lambda \sim \text{Beta}(\alpha, \alpha)\),\(\alpha\) 通常取 0.2-0.4。
Python
import numpy as np
import torch
class MixupTrainer:
def __init__(self, alpha=0.2): # __init__构造方法,创建对象时自动调用
self.alpha = alpha
def mixup_data(self, x, y):
lam = np.random.beta(self.alpha, self.alpha) if self.alpha > 0 else 1.0
batch_size = x.size(0)
index = torch.randperm(batch_size, device=x.device)
mixed_x = lam * x + (1 - lam) * x[index]
return mixed_x, y, y[index], lam
def mixup_criterion(self, criterion, pred, y_a, y_b, lam):
return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
def train_step(self, model, inputs, targets, optimizer, criterion):
mixed_inputs, targets_a, targets_b, lam = self.mixup_data(inputs, targets)
outputs = model(mixed_inputs)
loss = self.mixup_criterion(criterion, outputs, targets_a, targets_b, lam)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item() # .item()将单元素张量转为Python数值
2.2 CutMix¶
Yun et al.(2019):剪切一个图像的区域并粘贴到另一个图像上:
Python
def cutmix_data(x, y, alpha=1.0):
lam = np.random.beta(alpha, alpha)
batch_size = x.size(0)
index = torch.randperm(batch_size, device=x.device)
_, _, H, W = x.shape
cut_ratio = np.sqrt(1 - lam)
rH, rW = int(H * cut_ratio), int(W * cut_ratio)
cx, cy = np.random.randint(W), np.random.randint(H)
x1, y1 = max(cx - rW//2, 0), max(cy - rH//2, 0)
x2, y2 = min(cx + rW//2, W), min(cy + rH//2, H)
mixed = x.clone()
mixed[:, :, y1:y2, x1:x2] = x[index, :, y1:y2, x1:x2]
lam = 1 - (y2-y1)*(x2-x1)/(H*W)
return mixed, y, y[index], lam
2.3 AutoAugment & RandAugment¶
Python
import torchvision.transforms as T
# AutoAugment(自动搜索的增强策略)
auto_augment = T.Compose([
T.AutoAugment(policy=T.AutoAugmentPolicy.CIFAR10),
T.ToTensor(),
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
# RandAugment(更简单高效的替代方案)
rand_augment = T.Compose([
T.RandAugment(num_ops=2, magnitude=9), # 随机选择2个增强操作,强度9
T.ToTensor(),
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
# TrivialAugment(最简单的方案,效果也很好)
trivial_augment = T.Compose([
T.TrivialAugmentWide(),
T.ToTensor(),
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
3. 模型集成¶
3.1 集成方法¶
Python
class ModelEnsemble(nn.Module): # 继承nn.Module定义神经网络层
"""模型集成"""
def __init__(self, models, method='average'):
super().__init__() # super()调用父类方法
self.models = nn.ModuleList(models)
self.method = method
def forward(self, x):
outputs = [model(x) for model in self.models] # 列表推导式,简洁创建列表
if self.method == 'average':
return torch.stack(outputs).mean(dim=0) # 链式调用,连续执行多个方法
elif self.method == 'vote':
preds = [o.argmax(dim=1) for o in outputs]
stacked = torch.stack(preds, dim=0) # (num_models, batch)
return torch.mode(stacked, dim=0).values
elif self.method == 'weighted':
weights = torch.tensor([0.4, 0.3, 0.3], device=x.device)
return sum(w * o for w, o in zip(weights, outputs)) # zip按位置配对多个可迭代对象
# TTA(测试时增强)
def test_time_augmentation(model, image, n_augments=10):
"""测试时增强:对同一图像做多种变换后取平均"""
tta_transforms = T.Compose([
T.RandomHorizontalFlip(),
T.RandomCrop(32, padding=4),
])
model.eval() # eval()开启评估模式(关闭Dropout等)
predictions = []
with torch.no_grad(): # 禁用梯度计算,节省内存
# 原始图像
predictions.append(torch.softmax(model(image.unsqueeze(0)), dim=1))
# 增强后的图像
for _ in range(n_augments - 1):
augmented = tta_transforms(image)
pred = torch.softmax(model(augmented.unsqueeze(0)), dim=1)
predictions.append(pred)
return torch.stack(predictions).mean(dim=0)
4. CAM 可视化¶
4.1 Grad-CAM¶
Selvaraju et al.(2017):通过梯度信息生成类激活图,可视化模型关注的区域。
Python
import torch
import torch.nn.functional as F
import numpy as np
class GradCAM:
"""Grad-CAM 可视化"""
def __init__(self, model, target_layer):
self.model = model
self.target_layer = target_layer
self.gradients = None
self.activations = None
# 注册钩子
target_layer.register_forward_hook(self._forward_hook)
target_layer.register_full_backward_hook(self._backward_hook)
def _forward_hook(self, module, input, output):
self.activations = output.detach() # detach()从计算图分离,不参与梯度计算
def _backward_hook(self, module, grad_input, grad_output):
self.gradients = grad_output[0].detach()
def generate(self, input_image, target_class=None):
"""生成 Grad-CAM 热力图"""
self.model.eval()
output = self.model(input_image)
if target_class is None:
target_class = output.argmax(dim=1).item()
# 反向传播目标类的得分
self.model.zero_grad()
one_hot = torch.zeros_like(output)
one_hot[0, target_class] = 1
output.backward(gradient=one_hot)
# 计算权重:对梯度做全局平均池化
weights = self.gradients.mean(dim=(2, 3), keepdim=True) # (1, C, 1, 1)
# 加权求和
cam = (weights * self.activations).sum(dim=1, keepdim=True) # (1, 1, H, W)
cam = F.relu(cam) # 只关注正贡献
# 归一化到 [0, 1]
cam = cam - cam.min()
cam = cam / (cam.max() + 1e-8)
# 上采样到输入大小
cam = F.interpolate(cam, size=input_image.shape[2:], mode='bilinear', align_corners=False)
return cam.squeeze().cpu().numpy() # squeeze去除大小为1的维度
def visualize(self, image, cam, alpha=0.5):
"""叠加可视化"""
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# 原始图像
img = image.squeeze().permute(1, 2, 0).cpu().numpy()
img = (img - img.min()) / (img.max() - img.min())
axes[0].imshow(img)
axes[0].set_title('原始图像')
axes[0].axis('off')
# CAM 热力图
axes[1].imshow(cam, cmap='jet')
axes[1].set_title('Grad-CAM')
axes[1].axis('off')
# 叠加
axes[2].imshow(img)
axes[2].imshow(cam, cmap='jet', alpha=alpha)
axes[2].set_title('叠加')
axes[2].axis('off')
plt.tight_layout()
plt.show()
# 使用示例
# model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# gradcam = GradCAM(model, model.layer4[-1])
# image = torch.randn(1, 3, 224, 224)
# cam = gradcam.generate(image)
# gradcam.visualize(image, cam)
5. 训练技巧总结¶
5.1 完整训练技巧清单¶
| 技巧 | 说明 | 常用设置 |
|---|---|---|
| 学习率 | Warmup + Cosine | 初始 0.1 (SGD) / 1e-3 (Adam) |
| 优化器 | SGD+Momentum 或 AdamW | momentum=0.9, wd=5e-4 |
| 数据增强 | RandAugment + Mixup/CutMix | magnitude=9, alpha=0.2 |
| 正则化 | Dropout + Weight Decay + Label Smoothing | 0.1~0.3, 5e-4, 0.1 |
| 归一化 | Batch Norm (CNN) | 默认设置 |
| 初始化 | Kaiming Normal | mode='fan_out' |
| 梯度裁剪 | Clip by norm | max_norm=1.0 |
| EMA | 指数移动平均 | decay=0.9999 |
5.2 Label Smoothing¶
Python
class LabelSmoothingCrossEntropy(nn.Module):
def __init__(self, smoothing=0.1):
super().__init__()
self.smoothing = smoothing
def forward(self, pred, target):
n_classes = pred.size(-1)
log_prob = F.log_softmax(pred, dim=-1)
# 真实标签的对数概率
nll_loss = -log_prob.gather(dim=-1, index=target.unsqueeze(1)).squeeze(1)
# 均匀分布的对数概率
smooth_loss = -log_prob.mean(dim=-1)
loss = (1 - self.smoothing) * nll_loss + self.smoothing * smooth_loss
return loss.mean()
# 或直接使用 PyTorch 内置
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
5.3 EMA(指数移动平均)¶
Python
class EMA:
"""Exponential Moving Average"""
def __init__(self, model, decay=0.9999):
self.model = model
self.decay = decay
self.shadow = {}
self.backup = {}
for name, param in model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.shadow[name] = self.decay * self.shadow[name] + (1 - self.decay) * param.data
def apply_shadow(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.backup[name] = param.data
param.data = self.shadow[name]
def restore(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
param.data = self.backup[name]
self.backup = {}
6. 完整图像分类项目¶
Python
"""
完整的 CIFAR-10 图像分类项目
综合使用:迁移学习 + 数据增强 + Mixup + Label Smoothing + Cosine LR + EMA
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as T
import torchvision.models as models
from tqdm import tqdm
# ===== 配置 =====
class Config:
# 数据
batch_size = 128
num_workers = 4
num_classes = 10
# 训练
epochs = 100
lr = 0.1
weight_decay = 5e-4
momentum = 0.9
label_smoothing = 0.1
# 增强
mixup_alpha = 0.2
cutmix_alpha = 1.0
mixup_prob = 0.5 # Mixup vs CutMix 的概率
# EMA
ema_decay = 0.9999
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
config = Config()
# ===== 数据准备 =====
transform_train = T.Compose([
T.RandomCrop(32, padding=4),
T.RandomHorizontalFlip(),
T.RandAugment(num_ops=2, magnitude=9),
T.ToTensor(),
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
T.RandomErasing(p=0.25),
])
transform_test = T.Compose([
T.ToTensor(),
T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
trainset = torchvision.datasets.CIFAR10('./data', train=True, download=True, transform=transform_train)
testset = torchvision.datasets.CIFAR10('./data', train=False, download=True, transform=transform_test)
train_loader = DataLoader(trainset, batch_size=config.batch_size, shuffle=True, num_workers=config.num_workers) # DataLoader批量加载数据,支持shuffle和多进程
test_loader = DataLoader(testset, batch_size=config.batch_size, shuffle=False, num_workers=config.num_workers)
# ===== 模型 =====
model = models.resnet18(weights=None)
model.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False) # 适配32x32
model.maxpool = nn.Identity() # 移除maxpool
model.fc = nn.Linear(512, config.num_classes)
model = model.to(config.device)
# ===== 训练组件 =====
criterion = nn.CrossEntropyLoss(label_smoothing=config.label_smoothing)
optimizer = optim.SGD(model.parameters(), lr=config.lr, momentum=config.momentum, weight_decay=config.weight_decay)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=config.epochs)
ema = EMA(model, decay=config.ema_decay)
# ===== 训练循环 =====
def train_epoch(model, loader, optimizer, criterion, config):
model.train()
total_loss, correct, total = 0, 0, 0
for inputs, targets in loader:
inputs, targets = inputs.to(config.device), targets.to(config.device)
# 随机选择 Mixup 或 CutMix
if np.random.random() < config.mixup_prob:
inputs, targets_a, targets_b, lam = mixup_data(inputs, targets, config.mixup_alpha)
else:
inputs, targets_a, targets_b, lam = cutmix_data(inputs, targets, config.cutmix_alpha)
outputs = model(inputs)
loss = lam * criterion(outputs, targets_a) + (1 - lam) * criterion(outputs, targets_b)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
ema.update()
total_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += (lam * predicted.eq(targets_a).float() + (1-lam) * predicted.eq(targets_b).float()).sum().item()
return total_loss / total, 100. * correct / total
@torch.no_grad()
def evaluate(model, loader, device):
model.eval()
correct, total = 0, 0
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return 100. * correct / total
# ===== 主训练循环 =====
best_acc = 0
for epoch in range(config.epochs):
train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion, config)
# 使用 EMA 权重评估
ema.apply_shadow()
test_acc = evaluate(model, test_loader, config.device)
ema.restore()
scheduler.step()
lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1}/{config.epochs} | LR: {lr:.6f} | "
f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | Test Acc: {test_acc:.2f}%")
if test_acc > best_acc:
best_acc = test_acc
ema.apply_shadow()
torch.save(model.state_dict(), 'best_model.pth')
ema.restore()
print(f"\n最佳测试准确率: {best_acc:.2f}%")
7. 练习与自我检查¶
练习题¶
- 迁移学习:使用预训练 ResNet-50 在 Oxford Flowers 102 数据集上微调,对比特征提取和微调的效果。
- 数据增强对比:在 CIFAR-10 上对比无增强、基础增强、RandAugment、Mixup、CutMix 的效果。
- Grad-CAM:对训练好的模型生成 Grad-CAM 可视化,观察模型关注区域是否合理。
- 模型集成:训练 3 个不同架构的模型,对比单模型和集成后的准确率。
- 完整项目:运行本章的完整项目代码,尝试达到 CIFAR-10 上 95%+ 的准确率。
自我检查清单¶
- 能区分特征提取和微调两种迁移学习方式
- 理解分阶段微调的必要性和实现
- 掌握 Mixup、CutMix 的原理和代码实现
- 了解 RandAugment 和 AutoAugment 的使用
- 能实现 Grad-CAM 并解读可视化结果
- 理解 Label Smoothing 和 EMA 的作用
- 能编写完整的图像分类训练流程

