01-卷积神经网络基础¶
📌 章节定位:本文档隶属于深度学习教程体系,侧重CNN的数学原理与理论推导。 - 本文档重点:卷积运算的数学公式、参数计算推导、感受野计算、各种卷积变体原理 - 应用实践方向:如需了解CNN在图像分类、目标检测等CV任务中的实际应用、预训练模型使用、迁移学习等内容,请参考 计算机视觉/05-卷积神经网络基础.md
学习时间: 约6-8小时 难度级别: ⭐⭐⭐ 中级 前置知识: 神经网络基础、反向传播算法、PyTorch 基础 学习目标: 深入理解卷积运算原理,掌握CNN核心组件及PyTorch实现
目录¶
- 1. 卷积运算详解
- 2. 卷积层参数计算
- 3. 填充与步长
- 4. 池化层
- 5. 感受野计算
- 6. 转置卷积(反卷积)
- 7. 深度可分离卷积
- 8. 空洞卷积
- 9. 完整CNN结构示例
- 10. 练习与自我检查
1. 卷积运算详解¶
1.1 从全连接到卷积¶
全连接层处理图像的问题: - 参数爆炸: 一张 \(224 \times 224 \times 3\) 的图像展平后有 150,528 个输入,若隐藏层有 1000 个神经元,需要 \(\approx\) 1.5 亿个参数 - 忽略空间结构: 将图像展平丢失了像素间的空间关系 - 无平移不变性: 同一物体在不同位置需要重新学习
卷积层通过三个关键特性解决这些问题: 1. 局部连接(Local Connectivity): 每个神经元只连接输入的一个小区域 2. 参数共享(Parameter Sharing): 同一卷积核在所有位置共享参数 3. 平移等变性(Translation Equivariance): 输入平移 → 输出平移
1.2 一维卷积(1D Convolution)¶
一维卷积常用于序列数据(时间序列、文本等)。
对于输入信号 \(x\) 和核 \(w\),离散卷积定义为:
注:在深度学习中通常使用互相关(Cross-correlation)而非数学意义上的卷积(需翻转核),但习惯称之为"卷积"。
import torch
import torch.nn as nn
import torch.nn.functional as F
# ===== 手动实现 1D 卷积 =====
def conv1d_manual(input_signal, kernel):
"""手动实现1D卷积(互相关)"""
input_len = len(input_signal)
kernel_len = len(kernel)
output_len = input_len - kernel_len + 1
output = torch.zeros(output_len)
for i in range(output_len):
output[i] = torch.sum(input_signal[i:i+kernel_len] * kernel)
return output
# 测试
signal = torch.tensor([1.0, 2, 3, 4, 5, 6, 7, 8])
kernel = torch.tensor([1.0, 0, -1])
print("手动1D卷积:", conv1d_manual(signal, kernel))
# PyTorch 1D 卷积
# 输入: (batch_size, in_channels, length)
conv1d = nn.Conv1d(in_channels=1, out_channels=4, kernel_size=3, padding=1)
x = torch.randn(2, 1, 100) # batch=2, channels=1, length=100
output = conv1d(x)
print(f"Conv1d 输出 shape: {output.shape}") # (2, 4, 100)
1.3 二维卷积(2D Convolution)¶
二维卷积是图像处理中最核心的操作。
对于输入特征图 \(X \in \mathbb{R}^{H \times W}\) 和卷积核 \(K \in \mathbb{R}^{k_h \times k_w}\):
对于多通道输入 \(X \in \mathbb{R}^{C_{in} \times H \times W}\) 和多通道输出,卷积核 \(K \in \mathbb{R}^{C_{out} \times C_{in} \times k_h \times k_w}\):
# ===== 手动实现 2D 卷积 =====
def conv2d_manual(input_tensor, kernel, bias=None):
"""
手动实现2D卷积
input_tensor: (C_in, H, W)
kernel: (C_out, C_in, kH, kW)
"""
C_out, C_in, kH, kW = kernel.shape
_, H, W = input_tensor.shape
out_H = H - kH + 1
out_W = W - kW + 1
output = torch.zeros(C_out, out_H, out_W)
for co in range(C_out):
for i in range(out_H):
for j in range(out_W):
# 提取局部区域,与对应核做元素乘积求和
receptive_field = input_tensor[:, i:i+kH, j:j+kW]
output[co, i, j] = torch.sum(receptive_field * kernel[co])
if bias is not None:
output[co] += bias[co]
return output
# 测试
x = torch.randn(3, 8, 8) # 3通道, 8x8
k = torch.randn(16, 3, 3, 3) # 16个3x3卷积核
out = conv2d_manual(x, k)
print(f"手动2D卷积输出 shape: {out.shape}") # (16, 6, 6)
# PyTorch 2D 卷积
# 输入: (batch_size, C_in, H, W)
conv2d = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
x = torch.randn(4, 3, 32, 32) # batch=4, 3通道, 32x32
output = conv2d(x)
print(f"Conv2d 输出 shape: {output.shape}") # (4, 16, 32, 32)
1.4 三维卷积(3D Convolution)¶
三维卷积用于视频数据(时间 + 空间)或体积数据(如医学3D扫描):
# 3D 卷积
# 输入: (batch_size, C_in, D, H, W) — D 是深度/时间维
conv3d = nn.Conv3d(in_channels=3, out_channels=64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
x = torch.randn(2, 3, 16, 112, 112) # batch=2, 3通道, 16帧, 112x112
output = conv3d(x)
print(f"Conv3d 输出 shape: {output.shape}") # (2, 64, 16, 112, 112)
2. 卷积层参数计算¶
2.1 输出尺寸公式¶
对于输入尺寸 \(H_{in}\),卷积核大小 \(k\),填充 \(p\),步长 \(s\):
2.2 参数数量¶
最后的 \(+1\) 是偏置项(如果有的话)。
2.3 计算量(FLOPs)¶
每个输出位置的乘加次数为 \(C_{in} \times k_h \times k_w\),总计算量:
因子 2 是因为每次乘法后跟一次加法。
def calc_conv_params(in_channels, out_channels, kernel_size,
input_size, padding=0, stride=1, bias=True):
"""计算卷积层的参数数量和计算量"""
if isinstance(kernel_size, int): # isinstance检查对象类型
kh = kw = kernel_size
else:
kh, kw = kernel_size
if isinstance(padding, int):
ph = pw = padding
else:
ph, pw = padding
if isinstance(stride, int):
sh = sw = stride
else:
sh, sw = stride
if isinstance(input_size, int):
H_in = W_in = input_size
else:
H_in, W_in = input_size
# 输出尺寸
H_out = (H_in + 2*ph - kh) // sh + 1
W_out = (W_in + 2*pw - kw) // sw + 1
# 参数量
params = out_channels * (in_channels * kh * kw)
if bias:
params += out_channels
# FLOPs(乘加操作)
flops = out_channels * H_out * W_out * (2 * in_channels * kh * kw)
print(f"输出尺寸: ({out_channels}, {H_out}, {W_out})")
print(f"参数量: {params:,}")
print(f"FLOPs: {flops:,}")
return {'output_size': (out_channels, H_out, W_out), 'params': params, 'flops': flops}
# 示例:ResNet 的第一个卷积层
calc_conv_params(in_channels=3, out_channels=64, kernel_size=7,
input_size=224, padding=3, stride=2)
3. 填充与步长¶
3.1 填充(Padding)¶
Valid 填充(无填充,\(p=0\)):输出会缩小 Same 填充:保持输出与输入尺寸相同,\(p = \lfloor k/2 \rfloor\)(当 \(s=1\) 时)
# 不同填充模式
conv_valid = nn.Conv2d(3, 16, kernel_size=3, padding=0) # 输出缩小 2
conv_same = nn.Conv2d(3, 16, kernel_size=3, padding=1) # 输出大小不变
conv_full = nn.Conv2d(3, 16, kernel_size=3, padding=2) # 输出增大 2
x = torch.randn(1, 3, 32, 32)
print(f"Valid: {conv_valid(x).shape}") # (1, 16, 30, 30)
print(f"Same: {conv_same(x).shape}") # (1, 16, 32, 32)
print(f"Full: {conv_full(x).shape}") # (1, 16, 34, 34)
# 零填充 vs 反射填充 vs 复制填充
x = torch.randn(1, 3, 32, 32)
x_pad_zero = F.pad(x, (1, 1, 1, 1), mode='constant', value=0)
x_pad_reflect = F.pad(x, (1, 1, 1, 1), mode='reflect')
x_pad_replicate = F.pad(x, (1, 1, 1, 1), mode='replicate')
3.2 步长(Stride)¶
步长控制卷积核每次移动的距离,\(s > 1\) 会产生下采样效果:
conv_s1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
conv_s2 = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
x = torch.randn(1, 3, 32, 32)
print(f"Stride=1: {conv_s1(x).shape}") # (1, 16, 32, 32)
print(f"Stride=2: {conv_s2(x).shape}") # (1, 16, 16, 16) — 尺寸减半
4. 池化层¶
4.1 最大池化(Max Pooling)¶
取局部区域的最大值,保留最显著的特征:
4.2 平均池化(Average Pooling)¶
取局部区域的平均值:
4.3 全局平均池化(Global Average Pooling)¶
对整个特征图取平均,将 \((C, H, W)\) 变为 \((C, 1, 1)\):
# ===== 手动实现池化 =====
def max_pool2d_manual(x, kernel_size=2, stride=2):
"""手动实现2D最大池化"""
C, H, W = x.shape
out_H = (H - kernel_size) // stride + 1
out_W = (W - kernel_size) // stride + 1
output = torch.zeros(C, out_H, out_W)
for c in range(C):
for i in range(out_H):
for j in range(out_W):
h_start = i * stride
w_start = j * stride
region = x[c, h_start:h_start+kernel_size, w_start:w_start+kernel_size]
output[c, i, j] = region.max()
return output
# ===== PyTorch 池化层 =====
# 最大池化
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# 平均池化
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
# 全局平均池化
global_avg_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))
# 全局最大池化
global_max_pool = nn.AdaptiveMaxPool2d(output_size=(1, 1))
x = torch.randn(1, 64, 32, 32)
print(f"MaxPool: {max_pool(x).shape}") # (1, 64, 16, 16)
print(f"AvgPool: {avg_pool(x).shape}") # (1, 64, 16, 16)
print(f"GAP: {global_avg_pool(x).shape}") # (1, 64, 1, 1)
print(f"GMP: {global_max_pool(x).shape}") # (1, 64, 1, 1)
# 自适应池化(指定输出大小,自动计算 kernel_size 和 stride)
adaptive_pool = nn.AdaptiveAvgPool2d(output_size=(7, 7))
print(f"Adaptive: {adaptive_pool(x).shape}") # (1, 64, 7, 7)
5. 感受野计算¶
5.1 什么是感受野¶
感受野(Receptive Field)是指输出特征图上一个元素所"看到"的输入图像区域大小。感受野越大,网络捕获的上下文信息越丰富。
5.2 感受野计算公式¶
对于第 \(l\) 层:
或者等价地,从第一层开始递推:
其中 \(k_i\) 是第 \(i\) 层的卷积核大小,\(s_i\) 是第 \(i\) 层的步长。
def compute_receptive_field(layers):
"""
计算感受野
layers: [(kernel_size, stride), ...] 从浅到深
"""
rf = 1 # 初始感受野
stride_product = 1 # 累积步长
for k, s in layers:
rf = rf + (k - 1) * stride_product
stride_product *= s
return rf
# VGG-16 的前几层感受野
vgg_layers = [
(3, 1), (3, 1), (2, 2), # conv-conv-pool
(3, 1), (3, 1), (2, 2), # conv-conv-pool
(3, 1), (3, 1), (3, 1), (2, 2), # conv-conv-conv-pool
]
print(f"VGG 感受野: {compute_receptive_field(vgg_layers)}")
# ResNet-18 的前几层
resnet_layers = [
(7, 2), # conv1
(3, 2), # maxpool
(3, 1), (3, 1), # res block
(3, 1), (3, 1), # res block
]
print(f"ResNet 感受野: {compute_receptive_field(resnet_layers)}")
# 3个3x3卷积 vs 1个7x7卷积
print(f"3个3x3: RF = {compute_receptive_field([(3,1),(3,1),(3,1)])}") # 7
print(f"1个7x7: RF = {compute_receptive_field([(7,1)])}") # 7
# 参数量:3 × (3×3×C²) = 27C² vs 7×7×C² = 49C²,3个3x3更高效!
6. 转置卷积(反卷积)¶
6.1 原理¶
转置卷积(Transposed Convolution)用于上采样,将低分辨率特征图映射回高分辨率。它不是卷积的逆运算,而是卷积运算的转置。
对于普通卷积 \(Y = X * K\)(展开为矩阵运算 \(\mathbf{y} = \mathbf{C}\mathbf{x}\)),转置卷积是 \(\mathbf{x}' = \mathbf{C}^T\mathbf{y}\)。
输出尺寸: $\(H_{out} = (H_{in} - 1) \times s - 2p + k + \text{output\_padding}\)$
# ===== 转置卷积 =====
# 上采样:从 (1, 64, 8, 8) → (1, 32, 16, 16)
trans_conv = nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1)
x = torch.randn(1, 64, 8, 8)
print(f"转置卷积输出: {trans_conv(x).shape}") # (1, 32, 16, 16)
# 棋盘格伪影(Checkerboard Artifacts)的解决方案
# 方案1:使用 kernel_size 能被 stride 整除的配置
trans_conv_good = nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1)
# 方案2:先上采样再卷积(推荐)
upsample_conv = nn.Sequential(
nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),
nn.Conv2d(64, 32, kernel_size=3, padding=1)
)
# 方案3:PixelShuffle(亚像素卷积)
pixel_shuffle = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1), # 通道数需为 输出通道×r²
nn.PixelShuffle(upscale_factor=2) # (128, H, W) → (32, 2H, 2W)
)
x = torch.randn(1, 64, 8, 8)
print(f"PixelShuffle 输出: {pixel_shuffle(x).shape}") # (1, 32, 16, 16)
7. 深度可分离卷积¶
7.1 原理¶
深度可分离卷积(Depthwise Separable Convolution)将标准卷积分解为两步:
- 深度卷积(Depthwise Convolution): 每个通道独立地进行空间卷积
- 逐点卷积(Pointwise Convolution): 使用 \(1 \times 1\) 卷积混合通道信息
7.2 参数量对比¶
标准卷积参数量: \(C_{out} \times C_{in} \times k \times k\)
深度可分离卷积参数量: \(C_{in} \times k \times k + C_{out} \times C_{in}\)
压缩比: \(\frac{1}{C_{out}} + \frac{1}{k^2}\)
当 \(C_{out}=256\), \(k=3\) 时,参数量仅为标准卷积的 \(\frac{1}{256} + \frac{1}{9} \approx 11.5\%\)
# ===== 手动实现深度可分离卷积 =====
class DepthwiseSeparableConv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1): # __init__构造方法,创建对象时自动调用
super().__init__() # super()调用父类方法
# 深度卷积:groups=in_channels,每个通道独立卷积
self.depthwise = nn.Conv2d(
in_channels, in_channels, kernel_size,
stride=stride, padding=padding, groups=in_channels
)
# 逐点卷积:1x1 卷积混合通道
self.pointwise = nn.Conv2d(in_channels, out_channels, kernel_size=1)
self.bn1 = nn.BatchNorm2d(in_channels)
self.bn2 = nn.BatchNorm2d(out_channels)
def forward(self, x):
x = F.relu(self.bn1(self.depthwise(x)))
x = F.relu(self.bn2(self.pointwise(x)))
return x
# 参数量对比
standard_conv = nn.Conv2d(64, 128, 3, padding=1)
dw_sep_conv = DepthwiseSeparableConv(64, 128, 3, padding=1)
standard_params = sum(p.numel() for p in standard_conv.parameters())
dw_sep_params = sum(p.numel() for p in dw_sep_conv.parameters())
print(f"标准卷积参数量: {standard_params:,}")
print(f"深度可分离卷积参数量: {dw_sep_params:,}")
print(f"压缩比: {dw_sep_params/standard_params:.2%}")
8. 空洞卷积¶
8.1 原理¶
空洞卷积(Dilated/Atrous Convolution)在卷积核元素之间插入"空洞"(零),以增大感受野而不增加参数量。
膨胀率(dilation rate)\(d\) 控制空洞大小。有效卷积核大小为:
当 \(d=1\) 时退化为普通卷积。
# ===== 空洞卷积 =====
# 普通卷积 vs 空洞卷积
conv_normal = nn.Conv2d(64, 64, kernel_size=3, padding=1, dilation=1)
conv_dilated2 = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)
conv_dilated4 = nn.Conv2d(64, 64, kernel_size=3, padding=4, dilation=4)
x = torch.randn(1, 64, 32, 32)
print(f"dilation=1: {conv_normal(x).shape}") # 感受野 3x3
print(f"dilation=2: {conv_dilated2(x).shape}") # 感受野 5x5 (等效)
print(f"dilation=4: {conv_dilated4(x).shape}") # 感受野 9x9 (等效)
# 多尺度空洞卷积(Atrous Spatial Pyramid Pooling - ASPP,用于语义分割)
class ASPP(nn.Module):
def __init__(self, in_channels, out_channels, rates=[6, 12, 18]):
super().__init__()
self.conv1x1 = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)
self.dilated_convs = nn.ModuleList([
nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, padding=r, dilation=r),
nn.BatchNorm2d(out_channels),
nn.ReLU()
) for r in rates
])
self.global_pool = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(in_channels, out_channels, 1),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)
self.project = nn.Sequential(
nn.Conv2d(out_channels * (2 + len(rates)), out_channels, 1),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)
def forward(self, x):
size = x.shape[2:] # 切片操作取子序列
features = [self.conv1x1(x)]
for conv in self.dilated_convs:
features.append(conv(x))
global_feat = self.global_pool(x)
global_feat = F.interpolate(global_feat, size=size, mode='bilinear', align_corners=True)
features.append(global_feat)
return self.project(torch.cat(features, dim=1))
9. 完整CNN结构示例¶
import torch
import torch.nn as nn
import torch.nn.functional as F
class CompleteCNN(nn.Module):
"""包含所有核心组件的CNN示例"""
def __init__(self, num_classes=10):
super().__init__()
# 特征提取部分
self.features = nn.Sequential(
# Block 1: 标准卷积
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.Conv2d(32, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
# Block 2: 深度可分离卷积
DepthwiseSeparableConv(32, 64, kernel_size=3, padding=1),
DepthwiseSeparableConv(64, 64, kernel_size=3, padding=1),
nn.MaxPool2d(kernel_size=2, stride=2),
# Block 3: 空洞卷积
nn.Conv2d(64, 128, kernel_size=3, padding=2, dilation=2),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2),
)
# 全局平均池化
self.global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
# 分类头
self.classifier = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
self._initialize_weights()
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
def forward(self, x):
x = self.features(x) # 特征提取
x = self.global_avg_pool(x) # (N, 128, 1, 1)
x = x.view(x.size(0), -1) # (N, 128)
x = self.classifier(x) # (N, num_classes)
return x
# 测试
model = CompleteCNN(num_classes=10)
x = torch.randn(4, 3, 32, 32)
output = model(x)
print(f"输出 shape: {output.shape}")
# 统计参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"总参数量: {total_params:,}")
print(f"可训练参数量: {trainable_params:,}")
10. 练习与自我检查¶
练习题¶
-
手动卷积: 给定一个 \(5 \times 5\) 矩阵和一个 \(3 \times 3\) 核,手动计算卷积结果。然后用 PyTorch 验证。
-
参数计算: 画出一个 CNN 架构(3层卷积+2层全连接),计算每层的参数量和输出尺寸。
-
感受野: 设计一个网络使感受野覆盖整个 \(224 \times 224\) 的输入图像。
-
深度可分离卷积: 实现一个 MobileNet 风格的网络块,对比标准卷积的参数量和推理速度。
-
转置卷积: 实现一个简单的编码器-解码器结构(用卷积下采样,转置卷积上采样),在 MNIST 上做自编码器。
-
CIFAR-10 分类: 使用本章学到的组件构建 CNN,在 CIFAR-10 上达到 90%+ 测试准确率。
自我检查清单¶
- 理解卷积运算的数学定义(1D/2D/3D)
- 能手动计算卷积层的输出尺寸和参数量
- 理解 Max/Average/Global Average Pooling 的区别
- 能计算任意网络的感受野
- 理解转置卷积的上采样原理和棋盘格伪影
- 理解深度可分离卷积及其效率优势
- 了解空洞卷积如何扩大感受野
- 能用 PyTorch 搭建完整的 CNN 结构
下一篇: 02-经典CNN架构 — 学习影响深远的CNN架构设计






