05 - 条件生成与控制¶
⚠️ 时效性说明:本章涉及前沿模型/价格/榜单等信息,可能随版本快速变化;请以论文原文、官方发布页和 API 文档为准。
学习时间: 3.5小时 重要性: ⭐⭐⭐⭐⭐ 实现可控、实用的扩散模型的关键
🎯 学习目标¶
完成本章后,你将能够: - 理解条件扩散模型的基本原理 - 掌握分类器引导和无分类器引导 - 学习文本到图像生成的实现 - 实现图像编辑和修复功能 - 掌握多种控制方法
1. 条件生成概述¶
1.1 什么是条件生成?¶
条件生成是指根据给定的条件(如类别标签、文本描述、参考图像等)生成特定内容。
类比理解: - 无条件生成:"画一幅画" → 随机生成 - 条件生成:"画一只猫" → 生成猫
1.2 条件类型¶
| 条件类型 | 示例 | 应用场景 |
|---|---|---|
| 类别标签 | CIFAR-10的10个类别 | 分类控制生成 |
| 文本描述 | "一只可爱的猫" | 文本到图像 |
| 参考图像 | 输入图像 | 图像修复、编辑 |
| 空间条件 | 边缘图、深度图 | 结构控制 |
| 风格条件 | 艺术风格 | 风格迁移 |
2. 类别条件生成¶
2.1 基本原理¶
在扩散模型中添加类别条件,最简单的方法是将类别标签嵌入到模型中。
方法: 1. 将类别标签转换为嵌入向量 2. 将嵌入向量添加到时间步嵌入中 3. 或者在每个层中添加条件信息
2.2 条件UNet实现¶
import torch
import torch.nn as nn
import math
class SinusoidalPositionEmbedding(nn.Module): # 继承nn.Module定义网络层
"""正弦位置编码"""
def __init__(self, dim):
super().__init__() # super()调用父类方法
self.dim = dim
def forward(self, x):
"""
参数:
x: [batch_size]
返回:
embeddings: [batch_size, dim]
"""
device = x.device
half_dim = self.dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
emb = x[:, None] * emb[None, :]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1) # torch.cat沿已有维度拼接张量
return emb
class ClassConditionedUNet(nn.Module):
"""类别条件UNet"""
def __init__(self, in_channels=3, out_channels=3, model_dim=128, num_classes=10):
super().__init__()
# 时间步嵌入
self.time_embedding = SinusoidalPositionEmbedding(model_dim)
self.time_mlp = nn.Sequential(
nn.Linear(model_dim, model_dim * 4),
nn.SiLU(),
nn.Linear(model_dim * 4, model_dim)
)
# 类别嵌入
self.class_embedding = nn.Embedding(num_classes, model_dim)
self.class_mlp = nn.Sequential(
nn.Linear(model_dim, model_dim * 4),
nn.SiLU(),
nn.Linear(model_dim * 4, model_dim)
)
# 初始卷积
self.conv_in = nn.Conv2d(in_channels, model_dim, 3, padding=1)
# 下采样
self.down_blocks = nn.ModuleList([
self._make_down_block(model_dim, model_dim * 2),
self._make_down_block(model_dim * 2, model_dim * 4),
self._make_down_block(model_dim * 4, model_dim * 4),
])
# 中间层
self.mid_block = self._make_mid_block(model_dim * 4)
# 上采样
self.up_blocks = nn.ModuleList([
self._make_up_block(model_dim * 4, model_dim * 4),
self._make_up_block(model_dim * 4, model_dim * 2),
self._make_up_block(model_dim * 2, model_dim),
])
# 输出卷积
self.conv_out = nn.Conv2d(model_dim, out_channels, 3, padding=1)
def _make_down_block(self, in_channels, out_channels):
"""创建下采样块"""
return nn.ModuleList([
ResidualBlock(in_channels, out_channels),
ResidualBlock(out_channels, out_channels),
nn.Conv2d(out_channels, out_channels, 3, stride=2, padding=1)
])
def _make_up_block(self, in_channels, out_channels):
"""创建上采样块"""
return nn.ModuleList([
nn.ConvTranspose2d(in_channels, in_channels, 4, stride=2, padding=1),
ResidualBlock(in_channels, out_channels),
ResidualBlock(out_channels, out_channels)
])
def _make_mid_block(self, channels):
"""创建中间块"""
return nn.ModuleList([
ResidualBlock(channels, channels),
ResidualBlock(channels, channels)
])
def forward(self, x, t, class_labels=None):
"""
前向传播
参数:
x: [batch_size, in_channels, height, width]
t: [batch_size] 时间步
class_labels: [batch_size] 类别标签
返回:
输出: [batch_size, out_channels, height, width]
"""
# 时间步嵌入
t_emb = self.time_embedding(t)
t_emb = self.time_mlp(t_emb)
# 类别嵌入
if class_labels is not None:
c_emb = self.class_embedding(class_labels)
c_emb = self.class_mlp(c_emb)
# 合并时间步和类别嵌入
emb = t_emb + c_emb
else:
emb = t_emb
# 初始卷积
h = self.conv_in(x)
# 下采样
skips = []
for down_block in self.down_blocks:
for layer in down_block:
if isinstance(layer, ResidualBlock): # isinstance检查类型
h = layer(h, emb)
else:
h = layer(h)
skips.append(h)
# 中间层
for layer in self.mid_block:
h = layer(h, emb)
# 上采样
for i, up_block in enumerate(self.up_blocks): # enumerate同时获取索引和元素
for layer in up_block:
if isinstance(layer, nn.ConvTranspose2d):
h = layer(h)
h = h + skips[-(i+1)]
else:
h = layer(h, emb)
# 输出
h = self.conv_out(h)
return h
class ResidualBlock(nn.Module):
"""残差块"""
def __init__(self, in_channels, out_channels, emb_dim=None):
super().__init__()
self.norm1 = nn.GroupNorm(8, in_channels)
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.norm2 = nn.GroupNorm(8, out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
# 嵌入投影层:将emb_dim映射到out_channels
if emb_dim is not None and emb_dim != out_channels:
self.emb_proj = nn.Linear(emb_dim, out_channels)
else:
self.emb_proj = nn.Identity()
# 如果通道数不同,使用1x1卷积调整
if in_channels != out_channels:
self.skip_conv = nn.Conv2d(in_channels, out_channels, 1)
else:
self.skip_conv = nn.Identity()
self.activation = nn.SiLU()
def forward(self, x, emb):
"""
参数:
x: [batch_size, in_channels, h, w]
emb: [batch_size, emb_dim]
返回:
输出: [batch_size, out_channels, h, w]
"""
h = self.norm1(x)
h = self.activation(h)
h = self.conv1(h)
# 添加嵌入(投影到out_channels维度)
emb = self.emb_proj(emb)
emb = emb[:, :, None, None]
h = h + emb
h = self.norm2(h)
h = self.activation(h)
h = self.conv2(h)
# 残差连接
return h + self.skip_conv(x)
# 使用示例
model = ClassConditionedUNet(
in_channels=3,
out_channels=3,
model_dim=128,
num_classes=10
)
# 测试
x = torch.randn(4, 3, 32, 32)
t = torch.randint(0, 1000, (4,))
class_labels = torch.randint(0, 10, (4,))
output = model(x, t, class_labels)
print(f"输出形状: {output.shape}")
2.3 条件训练¶
def train_conditional_diffusion(model, train_loader, val_loader, num_epochs,
num_classes=10, device='cuda'):
"""
训练条件扩散模型
参数:
model: 条件扩散模型
train_loader: 训练数据加载器
val_loader: 验证数据加载器
num_epochs: 训练轮数
num_classes: 类别数
device: 设备
"""
model.to(device) # 移至GPU/CPU
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
# 创建噪声调度
T = 1000
betas = torch.linspace(0.0001, 0.02, T)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod = alphas_cumprod.to(device)
for epoch in range(num_epochs):
model.train() # train()训练模式
train_loss = 0
for x_0, labels in train_loader:
x_0 = x_0.to(device)
labels = labels.to(device)
# 随机采样时间步
batch_size = x_0.shape[0]
t = torch.randint(0, T, (batch_size,), device=device)
# 生成噪声
noise = torch.randn_like(x_0)
# 计算加噪后的图像
sqrt_alpha_t_bar = torch.sqrt(alphas_cumprod[t]).view(-1, 1, 1, 1) # 重塑张量形状
sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alphas_cumprod[t]).view(-1, 1, 1, 1)
x_t = sqrt_alpha_t_bar * x_0 + sqrt_one_minus_alpha_t_bar * noise
# 模型预测噪声
predicted_noise = model(x_t, t, labels)
# 计算损失
loss = nn.functional.mse_loss(predicted_noise, noise)
# 反向传播
optimizer.zero_grad() # 清零梯度
loss.backward() # 反向传播计算梯度
optimizer.step() # 更新参数
train_loss += loss.item() # 将单元素张量转为Python数值
train_loss /= len(train_loader)
# 验证
model.eval()
val_loss = 0
with torch.no_grad(): # 禁用梯度计算,节省内存
for x_0, labels in val_loader:
x_0 = x_0.to(device)
labels = labels.to(device)
batch_size = x_0.shape[0]
t = torch.randint(0, T, (batch_size,), device=device)
noise = torch.randn_like(x_0)
sqrt_alpha_t_bar = torch.sqrt(alphas_cumprod[t]).view(-1, 1, 1, 1)
sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alphas_cumprod[t]).view(-1, 1, 1, 1)
x_t = sqrt_alpha_t_bar * x_0 + sqrt_one_minus_alpha_t_bar * noise
predicted_noise = model(x_t, t, labels)
loss = nn.functional.mse_loss(predicted_noise, noise)
val_loss += loss.item()
val_loss /= len(val_loader)
print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
3. 无分类器引导¶
3.1 原理¶
无分类器引导(Classifier-Free Guidance, CFG)的核心思想是:同时训练条件模型和无条件模型,通过插值控制生成。
公式: $\(\epsilon_\theta^{CFG}(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))\)$
其中: - \(\epsilon_\theta(x_t, t, c)\):条件模型预测 - \(\epsilon_\theta(x_t, t, \emptyset)\):无条件模型预测 - \(w\):引导强度
3.2 训练CFG模型¶
def train_cfg_model(model, train_loader, val_loader, num_epochs,
num_classes=10, dropout_prob=0.1, device='cuda'):
"""
训练CFG模型
参数:
model: 条件扩散模型
train_loader: 训练数据加载器
val_loader: 验证数据加载器
num_epochs: 训练轮数
num_classes: 类别数
dropout_prob: 条件dropout概率
device: 设备
"""
model.to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
T = 1000
betas = torch.linspace(0.0001, 0.02, T)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
alphas_cumprod = alphas_cumprod.to(device)
for epoch in range(num_epochs):
model.train()
train_loss = 0
for x_0, labels in train_loader:
x_0 = x_0.to(device)
labels = labels.to(device)
batch_size = x_0.shape[0]
# 随机决定是否使用条件
use_condition = torch.rand(batch_size) > dropout_prob
condition_labels = labels.clone()
condition_labels[~use_condition] = 0 # 使用无效标签
# 随机采样时间步
t = torch.randint(0, T, (batch_size,), device=device)
# 生成噪声
noise = torch.randn_like(x_0)
# 计算加噪后的图像
sqrt_alpha_t_bar = torch.sqrt(alphas_cumprod[t]).view(-1, 1, 1, 1)
sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alphas_cumprod[t]).view(-1, 1, 1, 1)
x_t = sqrt_alpha_t_bar * x_0 + sqrt_one_minus_alpha_t_bar * noise
# 模型预测噪声
predicted_noise = model(x_t, t, condition_labels)
# 计算损失
loss = nn.functional.mse_loss(predicted_noise, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}")
3.3 CFG采样¶
def cfg_sample(model, x_T, class_labels, alphas, betas, alphas_cumprod,
guidance_scale=7.5, device='cuda'):
"""
CFG采样
参数:
model: 训练好的CFG模型
x_T: 初始噪声
class_labels: 类别标签
alphas, betas, alphas_cumprod: 调度表
guidance_scale: 引导强度
device: 设备
返回:
生成的图像
"""
model.eval()
x_t = x_T.to(device)
T = len(alphas)
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])
with torch.no_grad():
for t in reversed(range(T)):
alpha_t = alphas[t]
beta_t = betas[t]
alpha_t_bar = alphas_cumprod[t]
t_tensor = torch.full((x_t.shape[0],), t, device=device, dtype=torch.long)
# 预测条件噪声
noise_cond = model(x_t, t_tensor, class_labels)
# 预测无条件噪声(使用无效标签)
noise_uncond = model(x_t, t_tensor, torch.zeros_like(class_labels))
# 组合预测
predicted_noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
# 更新
sqrt_recip_alpha_t = 1 / torch.sqrt(alpha_t)
sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alpha_t_bar)
mean = sqrt_recip_alpha_t * (
x_t - (beta_t / sqrt_one_minus_alpha_t_bar) * predicted_noise
)
if t > 0:
alpha_t_bar_prev = alphas_cumprod_prev[t]
posterior_variance = beta_t * (1 - alpha_t_bar_prev) / (1 - alpha_t_bar)
noise = torch.randn_like(x_t)
x_t = mean + torch.sqrt(posterior_variance) * noise
else:
x_t = mean
return x_t
# 使用示例
# 生成特定类别的图像
class_labels = torch.tensor([0, 1, 2, 3]) # 生成4个不同类别的图像
x_T = torch.randn(4, 3, 32, 32).to('cuda')
samples = cfg_sample(model, x_T, class_labels, alphas, betas, alphas_cumprod,
guidance_scale=7.5, device='cuda')
print(f"生成完成,形状: {samples.shape}")
4. 文本到图像生成¶
4.1 文本编码¶
from transformers import CLIPTextModel, CLIPTokenizer
class TextEncoder(nn.Module):
"""文本编码器"""
def __init__(self, model_name="openai/clip-vit-base-patch32"):
super().__init__()
self.tokenizer = CLIPTokenizer.from_pretrained(model_name)
self.text_model = CLIPTextModel.from_pretrained(model_name)
def forward(self, text_prompts):
"""
编码文本提示
参数:
text_prompts: 文本提示列表
返回:
text_embeddings: [batch_size, seq_len, embedding_dim]
"""
# Tokenize
inputs = self.tokenizer(
text_prompts,
padding=True,
truncation=True,
return_tensors="pt"
)
# 编码
outputs = self.text_model(**inputs)
text_embeddings = outputs.last_hidden_state
return text_embeddings
# 使用示例
text_encoder = TextEncoder()
prompts = ["一只可爱的猫", "一只奔跑的狗", "一朵美丽的花"]
embeddings = text_encoder(prompts)
print(f"文本嵌入形状: {embeddings.shape}")
4.2 文本条件UNet¶
class TextConditionedUNet(nn.Module):
"""文本条件UNet"""
def __init__(self, in_channels=3, out_channels=3, model_dim=128,
text_embedding_dim=768):
super().__init__()
# 时间步嵌入
self.time_embedding = SinusoidalPositionEmbedding(model_dim)
self.time_mlp = nn.Sequential(
nn.Linear(model_dim, model_dim * 4),
nn.SiLU(),
nn.Linear(model_dim * 4, model_dim)
)
# 文本嵌入投影
self.text_proj = nn.Linear(text_embedding_dim, model_dim * 4)
self.text_mlp = nn.Sequential(
nn.Linear(model_dim * 4, model_dim * 4),
nn.SiLU(),
nn.Linear(model_dim * 4, model_dim)
)
# 交叉注意力
self.cross_attn = nn.MultiheadAttention(
embed_dim=model_dim,
num_heads=4,
batch_first=True
)
# UNet结构(简化版)
self.conv_in = nn.Conv2d(in_channels, model_dim, 3, padding=1)
# 下采样、中间层、上采样(省略详细实现)
self.down_blocks = nn.ModuleList([
self._make_down_block(model_dim, model_dim * 2),
self._make_down_block(model_dim * 2, model_dim * 4),
])
self.up_blocks = nn.ModuleList([
self._make_up_block(model_dim * 4, model_dim * 2),
self._make_up_block(model_dim * 2, model_dim),
])
self.conv_out = nn.Conv2d(model_dim, out_channels, 3, padding=1)
def _make_down_block(self, in_channels, out_channels):
"""创建下采样块"""
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, stride=2, padding=1),
nn.GroupNorm(8, out_channels),
nn.SiLU(),
nn.Conv2d(out_channels, out_channels, 3, padding=1),
nn.GroupNorm(8, out_channels),
nn.SiLU()
)
def _make_up_block(self, in_channels, out_channels):
"""创建上采样块"""
return nn.Sequential(
nn.ConvTranspose2d(in_channels, in_channels, 4, stride=2, padding=1),
nn.GroupNorm(8, in_channels),
nn.SiLU(),
nn.Conv2d(in_channels, out_channels, 3, padding=1),
nn.GroupNorm(8, out_channels),
nn.SiLU()
)
def forward(self, x, t, text_embeddings):
"""
前向传播
参数:
x: [batch_size, in_channels, h, w]
t: [batch_size] 时间步
text_embeddings: [batch_size, seq_len, text_dim]
返回:
输出: [batch_size, out_channels, h, w]
"""
# 时间步嵌入
t_emb = self.time_embedding(t)
t_emb = self.time_mlp(t_emb)
# 文本嵌入
text_emb = self.text_proj(text_embeddings)
text_emb = self.text_mlp(text_emb)
# 合并嵌入
emb = t_emb + text_emb.mean(dim=1) # 平均池化
# 初始卷积
h = self.conv_in(x)
# 下采样
skips = []
for down_block in self.down_blocks:
h = down_block(h)
skips.append(h)
# 上采样
for i, up_block in enumerate(self.up_blocks):
h = up_block(h)
h = h + skips[-(i+1)]
# 输出
h = self.conv_out(h)
return h
5. 图像编辑与修复¶
5.1 图像修复(Inpainting)¶
def inpaint(model, original_image, mask, alphas, betas, alphas_cumprod,
num_steps=1000, device='cuda'):
"""
图像修复
参数:
model: 扩散模型
original_image: 原始图像 [1, C, H, W]
mask: 掩码 [1, 1, H, W], 1表示需要修复的区域
alphas, betas, alphas_cumprod: 调度表
num_steps: 采样步数
device: 设备
返回:
修复后的图像
"""
model.eval()
original_image = original_image.to(device)
mask = mask.to(device)
# 初始化:在掩码区域添加噪声
x_T = torch.randn_like(original_image)
x_T = x_T * mask + original_image * (1 - mask)
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])
with torch.no_grad():
for t in reversed(range(num_steps)):
alpha_t = alphas[t]
beta_t = betas[t]
alpha_t_bar = alphas_cumprod[t]
t_tensor = torch.full((x_T.shape[0],), t, device=device, dtype=torch.long)
# 预测噪声
predicted_noise = model(x_T, t_tensor)
# 计算均值
sqrt_recip_alpha_t = 1 / torch.sqrt(alpha_t)
sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alpha_t_bar)
mean = sqrt_recip_alpha_t * (
x_T - (beta_t / sqrt_one_minus_alpha_t_bar) * predicted_noise
)
# 添加噪声
if t > 0:
alpha_t_bar_prev = alphas_cumprod_prev[t]
posterior_variance = beta_t * (1 - alpha_t_bar_prev) / (1 - alpha_t_bar)
noise = torch.randn_like(x_T)
x_t = mean + torch.sqrt(posterior_variance) * noise
else:
x_t = mean
# 在非掩码区域保持原始图像
x_t = x_t * mask + original_image * (1 - mask)
x_T = x_t # 更新循环变量用于下一迭代
return x_t
# 使用示例
# 创建掩码(中心区域需要修复)
mask = torch.zeros(1, 1, 32, 32)
mask[:, :, 10:22, 10:22] = 1
# 修复图像
repaired_image = inpaint(model, original_image, mask, alphas, betas, alphas_cumprod)
5.2 图像编辑¶
def edit_image(model, original_image, target_text, text_encoder, alphas, betas,
alphas_cumprod, guidance_scale=7.5, num_steps=1000, device='cuda'):
"""
图像编辑(基于文本)
参数:
model: 文本条件扩散模型
original_image: 原始图像
target_text: 目标文本描述
text_encoder: 文本编码器
alphas, betas, alphas_cumprod: 调度表
guidance_scale: 引导强度
num_steps: 采样步数
device: 设备
返回:
编辑后的图像
"""
model.eval()
original_image = original_image.to(device)
# 编码文本
text_embeddings = text_encoder([target_text]).to(device)
# 从原始图像开始(而不是纯噪声)
x_T = original_image
alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])
with torch.no_grad():
for t in reversed(range(num_steps)):
alpha_t = alphas[t]
beta_t = betas[t]
alpha_t_bar = alphas_cumprod[t]
t_tensor = torch.full((x_T.shape[0],), t, device=device, dtype=torch.long)
# 预测条件噪声
noise_cond = model(x_T, t_tensor, text_embeddings)
# 预测无条件噪声
noise_uncond = model(x_T, t_tensor, torch.zeros_like(text_embeddings))
# 组合预测
predicted_noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
# 更新
sqrt_recip_alpha_t = 1 / torch.sqrt(alpha_t)
sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alpha_t_bar)
mean = sqrt_recip_alpha_t * (
x_T - (beta_t / sqrt_one_minus_alpha_t_bar) * predicted_noise
)
if t > 0:
alpha_t_bar_prev = alphas_cumprod_prev[t]
posterior_variance = beta_t * (1 - alpha_t_bar_prev) / (1 - alpha_t_bar)
noise = torch.randn_like(x_T)
x_t = mean + torch.sqrt(posterior_variance) * noise
else:
x_t = mean
x_T = x_t # 更新循环变量用于下一迭代
return x_t
6. ControlNet 系统讲解¶
6.1 ControlNet 架构原理¶
ControlNet(Zhang & Agrawala, 2023)是一种向预训练扩散模型添加空间条件控制的方法,其核心创新在于零卷积(Zero Convolution)和训练冻结策略。
核心思想:创建一个与原始 UNet 编码器结构完全相同的可训练副本,通过零卷积层与原始网络连接。
ControlNet 架构:
条件输入(边缘图/深度图/骨架)
│
┌─────────▼─────────┐
│ 条件编码器 │
│ (CNN/Hint Block) │
└─────────┬─────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ 冻结的原始 UNet │ 可训练的 ControlNet │
│ ┌──────────────┐ Zero │ ┌──────────────┐ │
│ │ Encoder │◄──Conv───┼────│ Encoder │ │
│ │ Block 1 │ │ │ Block 1 │ │
│ └──────┬───────┘ │ └──────┬───────┘ │
│ │ │ │ │
│ ┌──────▼───────┐ Zero │ ┌──────▼───────┐ │
│ │ Encoder │◄──Conv───┼────│ Encoder │ │
│ │ Block 2 │ │ │ Block 2 │ │
│ └──────┬───────┘ │ └──────┬───────┘ │
│ │ │ │ │
│ ┌──────▼───────┐ │ ┌──────▼───────┐ │
│ │ Middle │◄──Zero───┼────│ Middle │ │
│ │ Block │ Conv │ │ Block │ │
│ └──────┬───────┘ │ └──────────────┘ │
│ │ │ │
│ ┌──────▼───────┐ │ │
│ │ Decoder │ │ 训练时:只训练 ControlNet │
│ │ (保持不变) │ │ 推理时:两路输出相加 │
│ └──────────────┘ │ │
└─────────────────────────────┴─────────────────────────────┘
关键设计:
| 设计要素 | 说明 |
|---|---|
| 零卷积 | 权重和偏置初始化为0的1×1卷积,训练初期 ControlNet 输出为0,不干扰原始模型 |
| 冻结策略 | 原始 UNet 参数完全冻结,只训练 ControlNet 副本 + 零卷积层 |
| 结构复制 | ControlNet 复制 UNet 编码器部分,保留预训练的语义理解能力 |
| 突然收敛 | 由于零卷积的存在,模型通常在少量步数内突然学会条件控制 |
6.2 ControlNet++:多条件控制与 Union 模型¶
ControlNet++ 和 ControlNet Union 在原版基础上实现了多条件控制:
- Multi-ControlNet:并行运行多个 ControlNet,输出加权求和 Text Only
输出 = w₁ × ControlNet(边缘图) + w₂ × ControlNet(深度图) + w₃ × ControlNet(姿态) ``` - **ControlNet Union**:单个模型支持多种条件类型,通过条件类型 ID 切换 - 优势:只需加载一个模型即可处理多种条件 - 支持条件类型:Canny、Depth、Pose、Scribble、Segmentation 等 - **ControlNet++**(Reward-based):引入像素级奖励引导,显著提升条件一致性 ### 6.3 ControlNet 代码示例 ```python from diffusers import ( StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, ) from diffusers.utils import load_image import torch import cv2 import numpy as np from PIL import Image # === 1. 加载 ControlNet 模型 === controlnet = ControlNetModel.from_pretrained( "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, variant="fp16", ) # 加载 SDXL Base + ControlNet pipe = StableDiffusionXLControlNetPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16", ) pipe.to("cuda") pipe.enable_model_cpu_offload() # 节省显存 # === 2. 准备控制条件(Canny 边缘图) === original_image = load_image("https://example.com/input.jpg") image_np = np.array(original_image) # np.array创建NumPy数组 canny_image = cv2.Canny(image_np, 100, 200) canny_image = Image.fromarray(canny_image).convert("RGB") # === 3. 生成图像 === result = pipe( prompt="a beautiful oil painting of a cityscape, masterpiece, detailed", negative_prompt="blurry, low quality, deformed", image=canny_image, # 控制条件 controlnet_conditioning_scale=0.7, # 控制强度(0~1) num_inference_steps=30, guidance_scale=7.5, ).images[0] result.save("controlnet_result.png") # === 4. Multi-ControlNet(多条件控制) === from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel controlnet_canny = ControlNetModel.from_pretrained( "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16 ) controlnet_depth = ControlNetModel.from_pretrained( "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16 ) pipe_multi = StableDiffusionXLControlNetPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", controlnet=[controlnet_canny, controlnet_depth], # 多 ControlNet torch_dtype=torch.float16, ) pipe_multi.to("cuda") result_multi = pipe_multi( prompt="futuristic architecture, photorealistic", image=[canny_image, depth_image], # 对应的条件图 controlnet_conditioning_scale=[0.7, 0.5], # 各自的控制强度 num_inference_steps=30, ).images[0]
7. IP-Adapter 深入讲解¶
7.1 IP-Adapter 架构(解耦 Cross-Attention 机制)¶
IP-Adapter(Image Prompt Adapter, Ye et al., 2023)通过解耦的 Cross-Attention 机制将图像特征注入到扩散模型中,实现以图像作为提示(Image Prompt)的生成控制。
IP-Adapter 架构:
┌────────────────┐ ┌────────────────┐
│ 文本提示 │ │ 参考图像 │
│ "a cat in..." │ │ (风格/内容) │
└───────┬────────┘ └───────┬────────┘
│ │
┌───────▼────────┐ ┌───────▼────────┐
│ CLIP Text │ │ CLIP Image │
│ Encoder │ │ Encoder │
└───────┬────────┘ └───────┬────────┘
│ │
│ 文本特征 │ 图像特征
│ │
┌───────▼────────┐ ┌───────▼────────┐
│ 原始 │ │ 新增 │
│ Cross-Attn │ │ Cross-Attn │
│ (冻结) │ │ (可训练) │
│ Q·K_text·V_text│ │ Q·K_img·V_img │
└───────┬────────┘ └───────┬────────┘
│ │
└──────── + ───────────┘
│
┌───────▼────────┐
│ UNet Feature │
│ (融合输出) │
└────────────────┘
核心公式:
Attention_new = Softmax(Q·K_text/√d)·V_text + λ·Softmax(Q·K_img/√d)·V_img
其中 λ 控制图像提示的影响强度
核心设计: - 解耦 Cross-Attention:文本和图像各有独立的 K、V 投影,Q 共享 - 仅训练 IP-Adapter 层:UNet 和 CLIP 编码器均冻结,参数量极少(~22M) - 可组合:与 ControlNet、LoRA 兼容,可同时使用
7.2 IP-Adapter 变体¶
| 变体 | 图像编码器 | 特点 | 适用场景 |
|---|---|---|---|
| IP-Adapter | CLIP ViT-H | 基础版,全局语义 | 风格迁移、内容参考 |
| IP-Adapter Plus | CLIP ViT-H + patch tokens | 更细粒度的图像细节 | 需要保留更多细节 |
| IP-Adapter FaceID | InsightFace (ArcFace) | 专用人脸特征 | 人脸一致性生成 |
| IP-Adapter FaceID Plus | InsightFace + CLIP | 人脸 + 全局特征 | 人脸 + 风格控制 |
| InstantStyle | CLIP + 风格/内容分离 | 仅注入风格层的 Cross-Attn | 纯风格迁移(不迁移内容) |
InstantStyle 的关键创新: - 发现 SDXL 的不同 Cross-Attention 层对应不同的语义: - Up blocks:主要控制风格(色彩、纹理、氛围) - Down blocks:主要控制内容/空间布局 - 仅在 Up blocks 注入 IP-Adapter 特征,实现纯风格迁移
7.3 IP-Adapter 代码示例¶
from diffusers import StableDiffusionXLPipeline
from diffusers.utils import load_image
import torch
# === 1. 基础 IP-Adapter 使用 ===
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
)
pipe.to("cuda")
# 加载 IP-Adapter 权重
pipe.load_ip_adapter(
"h94/IP-Adapter",
subfolder="sdxl_models",
weight_name="ip-adapter-plus_sdxl_vit-h.safetensors",
)
# 设置 IP-Adapter 强度
pipe.set_ip_adapter_scale(0.6) # 0~1, 越高图像提示影响越大
# 参考图像
ref_image = load_image("https://example.com/style_reference.jpg")
result = pipe(
prompt="a cat sitting on a windowsill, warm sunlight",
ip_adapter_image=ref_image, # 图像提示
num_inference_steps=30,
guidance_scale=7.5,
).images[0]
result.save("ip_adapter_result.png")
# === 2. IP-Adapter FaceID(人脸一致性) ===
from insightface.app import FaceAnalysis
# 提取人脸特征
app = FaceAnalysis(name="buffalo_l", providers=["CUDAExecutionProvider"])
app.prepare(ctx_id=0, det_size=(640, 640))
face_image = load_image("https://example.com/face.jpg")
faces = app.get(np.array(face_image))
face_embedding = torch.tensor(faces[0].normed_embedding).unsqueeze(0) # unsqueeze增加一个维度
pipe.load_ip_adapter(
"h94/IP-Adapter-FaceID",
subfolder="",
weight_name="ip-adapter-faceid-plusv2_sdxl.bin",
)
pipe.set_ip_adapter_scale(0.7)
result_face = pipe(
prompt="a professional portrait photo, studio lighting",
ip_adapter_image_embeds=[face_embedding],
num_inference_steps=30,
).images[0]
# === 3. IP-Adapter + ControlNet 组合 ===
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
controlnet = ControlNetModel.from_pretrained(
"diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)
pipe_combo = StableDiffusionXLControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=controlnet,
torch_dtype=torch.float16,
)
pipe_combo.to("cuda")
# 加载 IP-Adapter
pipe_combo.load_ip_adapter(
"h94/IP-Adapter", subfolder="sdxl_models",
weight_name="ip-adapter-plus_sdxl_vit-h.safetensors",
)
pipe_combo.set_ip_adapter_scale(0.5)
# 同时使用 IP-Adapter(风格)+ ControlNet(结构)
result_combo = pipe_combo(
prompt="interior design, modern minimalist",
image=canny_image, # ControlNet 条件
ip_adapter_image=style_reference_image, # IP-Adapter 风格
controlnet_conditioning_scale=0.7,
num_inference_steps=30,
).images[0]
8. 总结¶
8.1 核心技术回顾¶
| 技术 | 原理 | 应用场景 |
|---|---|---|
| 类别条件 | 将类别标签嵌入模型 | 分类控制生成 |
| 无分类器引导 | 条件与无条件预测的插值 | 提高条件生成的质量 |
| 文本到图像 | 使用文本编码器编码文本 | 文本控制生成 |
| 图像修复 | 在掩码区域进行扩散 | 图像修复、补全 |
| 图像编辑 | 基于文本编辑图像 | 图像风格迁移 |
| ControlNet | 零卷积 + 冻结UNet编码器副本 | 空间条件精细控制 |
| IP-Adapter | 解耦Cross-Attention图像注入 | 图像提示(风格/内容迁移) |
8.2 最佳实践¶
- 从简单开始:先用类别条件练习
- 调整引导强度:找到合适的引导强度
- 文本提示优化:使用清晰、具体的文本描述
- 掩码设计:合理设计修复区域的掩码
- 迭代优化:逐步调整参数
8.3 学习建议¶
- 理解原理:先理解条件生成的数学原理
- 动手实现:亲自实现各种条件生成方法
- 对比实验:对比不同方法的效果
- 应用实践:在实际项目中应用
9. 推荐资源¶
论文¶
- Classifier-Free Diffusion Guidance: "Classifier-Free Diffusion Guidance"
- GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models"
- Stable Diffusion: "High-Resolution Image Synthesis with Latent Diffusion Models"
- ControlNet: "Adding Conditional Control to Text-to-Image Diffusion Models" (Zhang & Agrawala, 2023)
- IP-Adapter: "IP-Adapter: Text Compatible Image Prompt Adapter" (Ye et al., 2023)
代码库¶
- Hugging Face Diffusers
- CompVis/stable-diffusion
- OpenAI/glide-text2im
- lllyasviel/ControlNet
- tencent-ailab/IP-Adapter
10. 自测问题¶
- 条件扩散模型和无条件扩散模型有什么区别?
- 无分类器引导的原理是什么?
- 如何实现文本到图像生成?
- 图像修复和图像编辑有什么区别?
- 如何调整引导强度?
- ControlNet 中零卷积的作用是什么?为什么需要冻结原始 UNet?
- IP-Adapter 如何实现解耦的 Cross-Attention?与 ControlNet 有什么区别?
- InstantStyle 如何实现纯风格迁移而不改变内容?
下一章: 05-实战项目 - 将所学知识应用到实际项目中