跳转至

05 - 条件生成与控制

⚠️ 时效性说明:本章涉及前沿模型/价格/榜单等信息,可能随版本快速变化;请以论文原文、官方发布页和 API 文档为准。

学习时间: 3.5小时 重要性: ⭐⭐⭐⭐⭐ 实现可控、实用的扩散模型的关键


🎯 学习目标

完成本章后,你将能够: - 理解条件扩散模型的基本原理 - 掌握分类器引导和无分类器引导 - 学习文本到图像生成的实现 - 实现图像编辑和修复功能 - 掌握多种控制方法


1. 条件生成概述

1.1 什么是条件生成?

条件生成是指根据给定的条件(如类别标签、文本描述、参考图像等)生成特定内容。

类比理解: - 无条件生成:"画一幅画" → 随机生成 - 条件生成:"画一只猫" → 生成猫

1.2 条件类型

条件类型 示例 应用场景
类别标签 CIFAR-10的10个类别 分类控制生成
文本描述 "一只可爱的猫" 文本到图像
参考图像 输入图像 图像修复、编辑
空间条件 边缘图、深度图 结构控制
风格条件 艺术风格 风格迁移

2. 类别条件生成

2.1 基本原理

在扩散模型中添加类别条件,最简单的方法是将类别标签嵌入到模型中。

方法: 1. 将类别标签转换为嵌入向量 2. 将嵌入向量添加到时间步嵌入中 3. 或者在每个层中添加条件信息

2.2 条件UNet实现

Python
import torch
import torch.nn as nn
import math

class SinusoidalPositionEmbedding(nn.Module):  # 继承nn.Module定义网络层
    """正弦位置编码"""

    def __init__(self, dim):
        super().__init__()  # super()调用父类方法
        self.dim = dim

    def forward(self, x):
        """
        参数:
            x: [batch_size]

        返回:
            embeddings: [batch_size, dim]
        """
        device = x.device
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = x[:, None] * emb[None, :]
        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)  # torch.cat沿已有维度拼接张量
        return emb

class ClassConditionedUNet(nn.Module):
    """类别条件UNet"""

    def __init__(self, in_channels=3, out_channels=3, model_dim=128, num_classes=10):
        super().__init__()

        # 时间步嵌入
        self.time_embedding = SinusoidalPositionEmbedding(model_dim)
        self.time_mlp = nn.Sequential(
            nn.Linear(model_dim, model_dim * 4),
            nn.SiLU(),
            nn.Linear(model_dim * 4, model_dim)
        )

        # 类别嵌入
        self.class_embedding = nn.Embedding(num_classes, model_dim)
        self.class_mlp = nn.Sequential(
            nn.Linear(model_dim, model_dim * 4),
            nn.SiLU(),
            nn.Linear(model_dim * 4, model_dim)
        )

        # 初始卷积
        self.conv_in = nn.Conv2d(in_channels, model_dim, 3, padding=1)

        # 下采样
        self.down_blocks = nn.ModuleList([
            self._make_down_block(model_dim, model_dim * 2),
            self._make_down_block(model_dim * 2, model_dim * 4),
            self._make_down_block(model_dim * 4, model_dim * 4),
        ])

        # 中间层
        self.mid_block = self._make_mid_block(model_dim * 4)

        # 上采样
        self.up_blocks = nn.ModuleList([
            self._make_up_block(model_dim * 4, model_dim * 4),
            self._make_up_block(model_dim * 4, model_dim * 2),
            self._make_up_block(model_dim * 2, model_dim),
        ])

        # 输出卷积
        self.conv_out = nn.Conv2d(model_dim, out_channels, 3, padding=1)

    def _make_down_block(self, in_channels, out_channels):
        """创建下采样块"""
        return nn.ModuleList([
            ResidualBlock(in_channels, out_channels),
            ResidualBlock(out_channels, out_channels),
            nn.Conv2d(out_channels, out_channels, 3, stride=2, padding=1)
        ])

    def _make_up_block(self, in_channels, out_channels):
        """创建上采样块"""
        return nn.ModuleList([
            nn.ConvTranspose2d(in_channels, in_channels, 4, stride=2, padding=1),
            ResidualBlock(in_channels, out_channels),
            ResidualBlock(out_channels, out_channels)
        ])

    def _make_mid_block(self, channels):
        """创建中间块"""
        return nn.ModuleList([
            ResidualBlock(channels, channels),
            ResidualBlock(channels, channels)
        ])

    def forward(self, x, t, class_labels=None):
        """
        前向传播

        参数:
            x: [batch_size, in_channels, height, width]
            t: [batch_size] 时间步
            class_labels: [batch_size] 类别标签

        返回:
            输出: [batch_size, out_channels, height, width]
        """
        # 时间步嵌入
        t_emb = self.time_embedding(t)
        t_emb = self.time_mlp(t_emb)

        # 类别嵌入
        if class_labels is not None:
            c_emb = self.class_embedding(class_labels)
            c_emb = self.class_mlp(c_emb)
            # 合并时间步和类别嵌入
            emb = t_emb + c_emb
        else:
            emb = t_emb

        # 初始卷积
        h = self.conv_in(x)

        # 下采样
        skips = []
        for down_block in self.down_blocks:
            for layer in down_block:
                if isinstance(layer, ResidualBlock):  # isinstance检查类型
                    h = layer(h, emb)
                else:
                    h = layer(h)
            skips.append(h)

        # 中间层
        for layer in self.mid_block:
            h = layer(h, emb)

        # 上采样
        for i, up_block in enumerate(self.up_blocks):  # enumerate同时获取索引和元素
            for layer in up_block:
                if isinstance(layer, nn.ConvTranspose2d):
                    h = layer(h)
                    h = h + skips[-(i+1)]
                else:
                    h = layer(h, emb)

        # 输出
        h = self.conv_out(h)
        return h

class ResidualBlock(nn.Module):
    """残差块"""

    def __init__(self, in_channels, out_channels, emb_dim=None):
        super().__init__()

        self.norm1 = nn.GroupNorm(8, in_channels)
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.norm2 = nn.GroupNorm(8, out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)

        # 嵌入投影层:将emb_dim映射到out_channels
        if emb_dim is not None and emb_dim != out_channels:
            self.emb_proj = nn.Linear(emb_dim, out_channels)
        else:
            self.emb_proj = nn.Identity()

        # 如果通道数不同,使用1x1卷积调整
        if in_channels != out_channels:
            self.skip_conv = nn.Conv2d(in_channels, out_channels, 1)
        else:
            self.skip_conv = nn.Identity()

        self.activation = nn.SiLU()

    def forward(self, x, emb):
        """
        参数:
            x: [batch_size, in_channels, h, w]
            emb: [batch_size, emb_dim]

        返回:
            输出: [batch_size, out_channels, h, w]
        """
        h = self.norm1(x)
        h = self.activation(h)
        h = self.conv1(h)

        # 添加嵌入(投影到out_channels维度)
        emb = self.emb_proj(emb)
        emb = emb[:, :, None, None]
        h = h + emb

        h = self.norm2(h)
        h = self.activation(h)
        h = self.conv2(h)

        # 残差连接
        return h + self.skip_conv(x)

# 使用示例
model = ClassConditionedUNet(
    in_channels=3,
    out_channels=3,
    model_dim=128,
    num_classes=10
)

# 测试
x = torch.randn(4, 3, 32, 32)
t = torch.randint(0, 1000, (4,))
class_labels = torch.randint(0, 10, (4,))

output = model(x, t, class_labels)
print(f"输出形状: {output.shape}")

2.3 条件训练

Python
def train_conditional_diffusion(model, train_loader, val_loader, num_epochs,
                              num_classes=10, device='cuda'):
    """
    训练条件扩散模型

    参数:
        model: 条件扩散模型
        train_loader: 训练数据加载器
        val_loader: 验证数据加载器
        num_epochs: 训练轮数
        num_classes: 类别数
        device: 设备
    """
    model.to(device)  # 移至GPU/CPU
    optimizer = optim.AdamW(model.parameters(), lr=1e-4)

    # 创建噪声调度
    T = 1000
    betas = torch.linspace(0.0001, 0.02, T)
    alphas = 1 - betas
    alphas_cumprod = torch.cumprod(alphas, dim=0)
    alphas_cumprod = alphas_cumprod.to(device)

    for epoch in range(num_epochs):
        model.train()  # train()训练模式
        train_loss = 0

        for x_0, labels in train_loader:
            x_0 = x_0.to(device)
            labels = labels.to(device)

            # 随机采样时间步
            batch_size = x_0.shape[0]
            t = torch.randint(0, T, (batch_size,), device=device)

            # 生成噪声
            noise = torch.randn_like(x_0)

            # 计算加噪后的图像
            sqrt_alpha_t_bar = torch.sqrt(alphas_cumprod[t]).view(-1, 1, 1, 1)  # 重塑张量形状
            sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alphas_cumprod[t]).view(-1, 1, 1, 1)
            x_t = sqrt_alpha_t_bar * x_0 + sqrt_one_minus_alpha_t_bar * noise

            # 模型预测噪声
            predicted_noise = model(x_t, t, labels)

            # 计算损失
            loss = nn.functional.mse_loss(predicted_noise, noise)

            # 反向传播
            optimizer.zero_grad()  # 清零梯度
            loss.backward()  # 反向传播计算梯度
            optimizer.step()  # 更新参数

            train_loss += loss.item()  # 将单元素张量转为Python数值

        train_loss /= len(train_loader)

        # 验证
        model.eval()
        val_loss = 0
        with torch.no_grad():  # 禁用梯度计算,节省内存
            for x_0, labels in val_loader:
                x_0 = x_0.to(device)
                labels = labels.to(device)

                batch_size = x_0.shape[0]
                t = torch.randint(0, T, (batch_size,), device=device)
                noise = torch.randn_like(x_0)

                sqrt_alpha_t_bar = torch.sqrt(alphas_cumprod[t]).view(-1, 1, 1, 1)
                sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alphas_cumprod[t]).view(-1, 1, 1, 1)
                x_t = sqrt_alpha_t_bar * x_0 + sqrt_one_minus_alpha_t_bar * noise

                predicted_noise = model(x_t, t, labels)
                loss = nn.functional.mse_loss(predicted_noise, noise)
                val_loss += loss.item()

        val_loss /= len(val_loader)

        print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

3. 无分类器引导

3.1 原理

无分类器引导(Classifier-Free Guidance, CFG)的核心思想是:同时训练条件模型和无条件模型,通过插值控制生成

公式: $\(\epsilon_\theta^{CFG}(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))\)$

其中: - \(\epsilon_\theta(x_t, t, c)\):条件模型预测 - \(\epsilon_\theta(x_t, t, \emptyset)\):无条件模型预测 - \(w\):引导强度

3.2 训练CFG模型

Python
def train_cfg_model(model, train_loader, val_loader, num_epochs,
                  num_classes=10, dropout_prob=0.1, device='cuda'):
    """
    训练CFG模型

    参数:
        model: 条件扩散模型
        train_loader: 训练数据加载器
        val_loader: 验证数据加载器
        num_epochs: 训练轮数
        num_classes: 类别数
        dropout_prob: 条件dropout概率
        device: 设备
    """
    model.to(device)
    optimizer = optim.AdamW(model.parameters(), lr=1e-4)

    T = 1000
    betas = torch.linspace(0.0001, 0.02, T)
    alphas = 1 - betas
    alphas_cumprod = torch.cumprod(alphas, dim=0)
    alphas_cumprod = alphas_cumprod.to(device)

    for epoch in range(num_epochs):
        model.train()
        train_loss = 0

        for x_0, labels in train_loader:
            x_0 = x_0.to(device)
            labels = labels.to(device)

            batch_size = x_0.shape[0]

            # 随机决定是否使用条件
            use_condition = torch.rand(batch_size) > dropout_prob
            condition_labels = labels.clone()
            condition_labels[~use_condition] = 0  # 使用无效标签

            # 随机采样时间步
            t = torch.randint(0, T, (batch_size,), device=device)

            # 生成噪声
            noise = torch.randn_like(x_0)

            # 计算加噪后的图像
            sqrt_alpha_t_bar = torch.sqrt(alphas_cumprod[t]).view(-1, 1, 1, 1)
            sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alphas_cumprod[t]).view(-1, 1, 1, 1)
            x_t = sqrt_alpha_t_bar * x_0 + sqrt_one_minus_alpha_t_bar * noise

            # 模型预测噪声
            predicted_noise = model(x_t, t, condition_labels)

            # 计算损失
            loss = nn.functional.mse_loss(predicted_noise, noise)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        train_loss /= len(train_loader)
        print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}")

3.3 CFG采样

Python
def cfg_sample(model, x_T, class_labels, alphas, betas, alphas_cumprod,
              guidance_scale=7.5, device='cuda'):
    """
    CFG采样

    参数:
        model: 训练好的CFG模型
        x_T: 初始噪声
        class_labels: 类别标签
        alphas, betas, alphas_cumprod: 调度表
        guidance_scale: 引导强度
        device: 设备

    返回:
        生成的图像
    """
    model.eval()
    x_t = x_T.to(device)

    T = len(alphas)
    alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])

    with torch.no_grad():
        for t in reversed(range(T)):
            alpha_t = alphas[t]
            beta_t = betas[t]
            alpha_t_bar = alphas_cumprod[t]

            t_tensor = torch.full((x_t.shape[0],), t, device=device, dtype=torch.long)

            # 预测条件噪声
            noise_cond = model(x_t, t_tensor, class_labels)

            # 预测无条件噪声(使用无效标签)
            noise_uncond = model(x_t, t_tensor, torch.zeros_like(class_labels))

            # 组合预测
            predicted_noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

            # 更新
            sqrt_recip_alpha_t = 1 / torch.sqrt(alpha_t)
            sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alpha_t_bar)

            mean = sqrt_recip_alpha_t * (
                x_t - (beta_t / sqrt_one_minus_alpha_t_bar) * predicted_noise
            )

            if t > 0:
                alpha_t_bar_prev = alphas_cumprod_prev[t]
                posterior_variance = beta_t * (1 - alpha_t_bar_prev) / (1 - alpha_t_bar)
                noise = torch.randn_like(x_t)
                x_t = mean + torch.sqrt(posterior_variance) * noise
            else:
                x_t = mean

    return x_t

# 使用示例
# 生成特定类别的图像
class_labels = torch.tensor([0, 1, 2, 3])  # 生成4个不同类别的图像
x_T = torch.randn(4, 3, 32, 32).to('cuda')

samples = cfg_sample(model, x_T, class_labels, alphas, betas, alphas_cumprod,
                     guidance_scale=7.5, device='cuda')
print(f"生成完成,形状: {samples.shape}")

4. 文本到图像生成

4.1 文本编码

Python
from transformers import CLIPTextModel, CLIPTokenizer

class TextEncoder(nn.Module):
    """文本编码器"""

    def __init__(self, model_name="openai/clip-vit-base-patch32"):
        super().__init__()
        self.tokenizer = CLIPTokenizer.from_pretrained(model_name)
        self.text_model = CLIPTextModel.from_pretrained(model_name)

    def forward(self, text_prompts):
        """
        编码文本提示

        参数:
            text_prompts: 文本提示列表

        返回:
            text_embeddings: [batch_size, seq_len, embedding_dim]
        """
        # Tokenize
        inputs = self.tokenizer(
            text_prompts,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

        # 编码
        outputs = self.text_model(**inputs)
        text_embeddings = outputs.last_hidden_state

        return text_embeddings

# 使用示例
text_encoder = TextEncoder()
prompts = ["一只可爱的猫", "一只奔跑的狗", "一朵美丽的花"]
embeddings = text_encoder(prompts)
print(f"文本嵌入形状: {embeddings.shape}")

4.2 文本条件UNet

Python
class TextConditionedUNet(nn.Module):
    """文本条件UNet"""

    def __init__(self, in_channels=3, out_channels=3, model_dim=128,
                 text_embedding_dim=768):
        super().__init__()

        # 时间步嵌入
        self.time_embedding = SinusoidalPositionEmbedding(model_dim)
        self.time_mlp = nn.Sequential(
            nn.Linear(model_dim, model_dim * 4),
            nn.SiLU(),
            nn.Linear(model_dim * 4, model_dim)
        )

        # 文本嵌入投影
        self.text_proj = nn.Linear(text_embedding_dim, model_dim * 4)
        self.text_mlp = nn.Sequential(
            nn.Linear(model_dim * 4, model_dim * 4),
            nn.SiLU(),
            nn.Linear(model_dim * 4, model_dim)
        )

        # 交叉注意力
        self.cross_attn = nn.MultiheadAttention(
            embed_dim=model_dim,
            num_heads=4,
            batch_first=True
        )

        # UNet结构(简化版)
        self.conv_in = nn.Conv2d(in_channels, model_dim, 3, padding=1)

        # 下采样、中间层、上采样(省略详细实现)
        self.down_blocks = nn.ModuleList([
            self._make_down_block(model_dim, model_dim * 2),
            self._make_down_block(model_dim * 2, model_dim * 4),
        ])

        self.up_blocks = nn.ModuleList([
            self._make_up_block(model_dim * 4, model_dim * 2),
            self._make_up_block(model_dim * 2, model_dim),
        ])

        self.conv_out = nn.Conv2d(model_dim, out_channels, 3, padding=1)

    def _make_down_block(self, in_channels, out_channels):
        """创建下采样块"""
        return nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, stride=2, padding=1),
            nn.GroupNorm(8, out_channels),
            nn.SiLU(),
            nn.Conv2d(out_channels, out_channels, 3, padding=1),
            nn.GroupNorm(8, out_channels),
            nn.SiLU()
        )

    def _make_up_block(self, in_channels, out_channels):
        """创建上采样块"""
        return nn.Sequential(
            nn.ConvTranspose2d(in_channels, in_channels, 4, stride=2, padding=1),
            nn.GroupNorm(8, in_channels),
            nn.SiLU(),
            nn.Conv2d(in_channels, out_channels, 3, padding=1),
            nn.GroupNorm(8, out_channels),
            nn.SiLU()
        )

    def forward(self, x, t, text_embeddings):
        """
        前向传播

        参数:
            x: [batch_size, in_channels, h, w]
            t: [batch_size] 时间步
            text_embeddings: [batch_size, seq_len, text_dim]

        返回:
            输出: [batch_size, out_channels, h, w]
        """
        # 时间步嵌入
        t_emb = self.time_embedding(t)
        t_emb = self.time_mlp(t_emb)

        # 文本嵌入
        text_emb = self.text_proj(text_embeddings)
        text_emb = self.text_mlp(text_emb)

        # 合并嵌入
        emb = t_emb + text_emb.mean(dim=1)  # 平均池化

        # 初始卷积
        h = self.conv_in(x)

        # 下采样
        skips = []
        for down_block in self.down_blocks:
            h = down_block(h)
            skips.append(h)

        # 上采样
        for i, up_block in enumerate(self.up_blocks):
            h = up_block(h)
            h = h + skips[-(i+1)]

        # 输出
        h = self.conv_out(h)
        return h

5. 图像编辑与修复

5.1 图像修复(Inpainting)

Python
def inpaint(model, original_image, mask, alphas, betas, alphas_cumprod,
            num_steps=1000, device='cuda'):
    """
    图像修复

    参数:
        model: 扩散模型
        original_image: 原始图像 [1, C, H, W]
        mask: 掩码 [1, 1, H, W], 1表示需要修复的区域
        alphas, betas, alphas_cumprod: 调度表
        num_steps: 采样步数
        device: 设备

    返回:
        修复后的图像
    """
    model.eval()
    original_image = original_image.to(device)
    mask = mask.to(device)

    # 初始化:在掩码区域添加噪声
    x_T = torch.randn_like(original_image)
    x_T = x_T * mask + original_image * (1 - mask)

    alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])

    with torch.no_grad():
        for t in reversed(range(num_steps)):
            alpha_t = alphas[t]
            beta_t = betas[t]
            alpha_t_bar = alphas_cumprod[t]

            t_tensor = torch.full((x_T.shape[0],), t, device=device, dtype=torch.long)

            # 预测噪声
            predicted_noise = model(x_T, t_tensor)

            # 计算均值
            sqrt_recip_alpha_t = 1 / torch.sqrt(alpha_t)
            sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alpha_t_bar)
            mean = sqrt_recip_alpha_t * (
                x_T - (beta_t / sqrt_one_minus_alpha_t_bar) * predicted_noise
            )

            # 添加噪声
            if t > 0:
                alpha_t_bar_prev = alphas_cumprod_prev[t]
                posterior_variance = beta_t * (1 - alpha_t_bar_prev) / (1 - alpha_t_bar)
                noise = torch.randn_like(x_T)
                x_t = mean + torch.sqrt(posterior_variance) * noise
            else:
                x_t = mean

            # 在非掩码区域保持原始图像
            x_t = x_t * mask + original_image * (1 - mask)
            x_T = x_t  # 更新循环变量用于下一迭代

    return x_t

# 使用示例
# 创建掩码(中心区域需要修复)
mask = torch.zeros(1, 1, 32, 32)
mask[:, :, 10:22, 10:22] = 1

# 修复图像
repaired_image = inpaint(model, original_image, mask, alphas, betas, alphas_cumprod)

5.2 图像编辑

Python
def edit_image(model, original_image, target_text, text_encoder, alphas, betas,
              alphas_cumprod, guidance_scale=7.5, num_steps=1000, device='cuda'):
    """
    图像编辑(基于文本)

    参数:
        model: 文本条件扩散模型
        original_image: 原始图像
        target_text: 目标文本描述
        text_encoder: 文本编码器
        alphas, betas, alphas_cumprod: 调度表
        guidance_scale: 引导强度
        num_steps: 采样步数
        device: 设备

    返回:
        编辑后的图像
    """
    model.eval()
    original_image = original_image.to(device)

    # 编码文本
    text_embeddings = text_encoder([target_text]).to(device)

    # 从原始图像开始(而不是纯噪声)
    x_T = original_image

    alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])

    with torch.no_grad():
        for t in reversed(range(num_steps)):
            alpha_t = alphas[t]
            beta_t = betas[t]
            alpha_t_bar = alphas_cumprod[t]

            t_tensor = torch.full((x_T.shape[0],), t, device=device, dtype=torch.long)

            # 预测条件噪声
            noise_cond = model(x_T, t_tensor, text_embeddings)

            # 预测无条件噪声
            noise_uncond = model(x_T, t_tensor, torch.zeros_like(text_embeddings))

            # 组合预测
            predicted_noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

            # 更新
            sqrt_recip_alpha_t = 1 / torch.sqrt(alpha_t)
            sqrt_one_minus_alpha_t_bar = torch.sqrt(1 - alpha_t_bar)
            mean = sqrt_recip_alpha_t * (
                x_T - (beta_t / sqrt_one_minus_alpha_t_bar) * predicted_noise
            )

            if t > 0:
                alpha_t_bar_prev = alphas_cumprod_prev[t]
                posterior_variance = beta_t * (1 - alpha_t_bar_prev) / (1 - alpha_t_bar)
                noise = torch.randn_like(x_T)
                x_t = mean + torch.sqrt(posterior_variance) * noise
            else:
                x_t = mean
            x_T = x_t  # 更新循环变量用于下一迭代

    return x_t

6. ControlNet 系统讲解

6.1 ControlNet 架构原理

ControlNet(Zhang & Agrawala, 2023)是一种向预训练扩散模型添加空间条件控制的方法,其核心创新在于零卷积(Zero Convolution)训练冻结策略

核心思想:创建一个与原始 UNet 编码器结构完全相同的可训练副本,通过零卷积层与原始网络连接。

Text Only
ControlNet 架构:

                    条件输入(边缘图/深度图/骨架)
                    ┌─────────▼─────────┐
                    │   条件编码器       │
                    │ (CNN/Hint Block)   │
                    └─────────┬─────────┘
┌─────────────────────────────┼─────────────────────────────┐
│  冻结的原始 UNet            │       可训练的 ControlNet    │
│  ┌──────────────┐   Zero   │    ┌──────────────┐          │
│  │  Encoder     │◄──Conv───┼────│  Encoder     │          │
│  │  Block 1     │          │    │  Block 1     │          │
│  └──────┬───────┘          │    └──────┬───────┘          │
│         │                  │           │                   │
│  ┌──────▼───────┐   Zero   │    ┌──────▼───────┐          │
│  │  Encoder     │◄──Conv───┼────│  Encoder     │          │
│  │  Block 2     │          │    │  Block 2     │          │
│  └──────┬───────┘          │    └──────┬───────┘          │
│         │                  │           │                   │
│  ┌──────▼───────┐          │    ┌──────▼───────┐          │
│  │  Middle      │◄──Zero───┼────│  Middle      │          │
│  │  Block       │   Conv   │    │  Block       │          │
│  └──────┬───────┘          │    └──────────────┘          │
│         │                  │                               │
│  ┌──────▼───────┐          │                               │
│  │  Decoder     │          │  训练时:只训练 ControlNet    │
│  │  (保持不变)   │          │  推理时:两路输出相加         │
│  └──────────────┘          │                               │
└─────────────────────────────┴─────────────────────────────┘

关键设计

设计要素 说明
零卷积 权重和偏置初始化为0的1×1卷积,训练初期 ControlNet 输出为0,不干扰原始模型
冻结策略 原始 UNet 参数完全冻结,只训练 ControlNet 副本 + 零卷积层
结构复制 ControlNet 复制 UNet 编码器部分,保留预训练的语义理解能力
突然收敛 由于零卷积的存在,模型通常在少量步数内突然学会条件控制

6.2 ControlNet++:多条件控制与 Union 模型

ControlNet++ControlNet Union 在原版基础上实现了多条件控制:

  • Multi-ControlNet:并行运行多个 ControlNet,输出加权求和
    Text Only
      输出 = w₁ × ControlNet(边缘图) + w₂ × ControlNet(深度图) + w₃ × ControlNet(姿态)
      ```
    - **ControlNet Union**:单个模型支持多种条件类型,通过条件类型 ID 切换
      - 优势:只需加载一个模型即可处理多种条件
      - 支持条件类型:Canny、Depth、Pose、Scribble、Segmentation 等
    - **ControlNet++**(Reward-based):引入像素级奖励引导,显著提升条件一致性
    
    ### 6.3 ControlNet 代码示例
    
    ```python
    from diffusers import (
        StableDiffusionXLControlNetPipeline,
        ControlNetModel,
        AutoencoderKL,
    )
    from diffusers.utils import load_image
    import torch
    import cv2
    import numpy as np
    from PIL import Image
    
    # === 1. 加载 ControlNet 模型 ===
    controlnet = ControlNetModel.from_pretrained(
        "diffusers/controlnet-canny-sdxl-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
    )
    
    # 加载 SDXL Base + ControlNet
    pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        controlnet=controlnet,
        torch_dtype=torch.float16,
        variant="fp16",
    )
    pipe.to("cuda")
    pipe.enable_model_cpu_offload()  # 节省显存
    
    # === 2. 准备控制条件(Canny 边缘图) ===
    original_image = load_image("https://example.com/input.jpg")
    image_np = np.array(original_image)  # np.array创建NumPy数组
    canny_image = cv2.Canny(image_np, 100, 200)
    canny_image = Image.fromarray(canny_image).convert("RGB")
    
    # === 3. 生成图像 ===
    result = pipe(
        prompt="a beautiful oil painting of a cityscape, masterpiece, detailed",
        negative_prompt="blurry, low quality, deformed",
        image=canny_image,                 # 控制条件
        controlnet_conditioning_scale=0.7, # 控制强度(0~1)
        num_inference_steps=30,
        guidance_scale=7.5,
    ).images[0]
    result.save("controlnet_result.png")
    
    # === 4. Multi-ControlNet(多条件控制) ===
    from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
    
    controlnet_canny = ControlNetModel.from_pretrained(
        "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
    )
    controlnet_depth = ControlNetModel.from_pretrained(
        "diffusers/controlnet-depth-sdxl-1.0", torch_dtype=torch.float16
    )
    
    pipe_multi = StableDiffusionXLControlNetPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        controlnet=[controlnet_canny, controlnet_depth],  # 多 ControlNet
        torch_dtype=torch.float16,
    )
    pipe_multi.to("cuda")
    
    result_multi = pipe_multi(
        prompt="futuristic architecture, photorealistic",
        image=[canny_image, depth_image],               # 对应的条件图
        controlnet_conditioning_scale=[0.7, 0.5],        # 各自的控制强度
        num_inference_steps=30,
    ).images[0]
    

7. IP-Adapter 深入讲解

7.1 IP-Adapter 架构(解耦 Cross-Attention 机制)

IP-Adapter(Image Prompt Adapter, Ye et al., 2023)通过解耦的 Cross-Attention 机制将图像特征注入到扩散模型中,实现以图像作为提示(Image Prompt)的生成控制。

Text Only
IP-Adapter 架构:

┌────────────────┐     ┌────────────────┐
│  文本提示       │     │  参考图像       │
│  "a cat in..."  │     │  (风格/内容)    │
└───────┬────────┘     └───────┬────────┘
        │                      │
┌───────▼────────┐     ┌───────▼────────┐
│  CLIP Text     │     │  CLIP Image    │
│  Encoder       │     │  Encoder       │
└───────┬────────┘     └───────┬────────┘
        │                      │
        │  文本特征             │  图像特征
        │                      │
┌───────▼────────┐     ┌───────▼────────┐
│  原始          │     │  新增          │
│  Cross-Attn    │     │  Cross-Attn    │
│  (冻结)        │     │  (可训练)      │
│  Q·K_text·V_text│     │  Q·K_img·V_img │
└───────┬────────┘     └───────┬────────┘
        │                      │
        └──────── + ───────────┘
          ┌───────▼────────┐
          │  UNet Feature  │
          │  (融合输出)     │
          └────────────────┘

核心公式:
  Attention_new = Softmax(Q·K_text/√d)·V_text + λ·Softmax(Q·K_img/√d)·V_img
  其中 λ 控制图像提示的影响强度

核心设计: - 解耦 Cross-Attention:文本和图像各有独立的 K、V 投影,Q 共享 - 仅训练 IP-Adapter 层:UNet 和 CLIP 编码器均冻结,参数量极少(~22M) - 可组合:与 ControlNet、LoRA 兼容,可同时使用

7.2 IP-Adapter 变体

变体 图像编码器 特点 适用场景
IP-Adapter CLIP ViT-H 基础版,全局语义 风格迁移、内容参考
IP-Adapter Plus CLIP ViT-H + patch tokens 更细粒度的图像细节 需要保留更多细节
IP-Adapter FaceID InsightFace (ArcFace) 专用人脸特征 人脸一致性生成
IP-Adapter FaceID Plus InsightFace + CLIP 人脸 + 全局特征 人脸 + 风格控制
InstantStyle CLIP + 风格/内容分离 仅注入风格层的 Cross-Attn 纯风格迁移(不迁移内容)

InstantStyle 的关键创新: - 发现 SDXL 的不同 Cross-Attention 层对应不同的语义: - Up blocks:主要控制风格(色彩、纹理、氛围) - Down blocks:主要控制内容/空间布局 - 仅在 Up blocks 注入 IP-Adapter 特征,实现纯风格迁移

7.3 IP-Adapter 代码示例

Python
from diffusers import StableDiffusionXLPipeline
from diffusers.utils import load_image
import torch

# === 1. 基础 IP-Adapter 使用 ===
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
)
pipe.to("cuda")

# 加载 IP-Adapter 权重
pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter-plus_sdxl_vit-h.safetensors",
)

# 设置 IP-Adapter 强度
pipe.set_ip_adapter_scale(0.6)  # 0~1, 越高图像提示影响越大

# 参考图像
ref_image = load_image("https://example.com/style_reference.jpg")

result = pipe(
    prompt="a cat sitting on a windowsill, warm sunlight",
    ip_adapter_image=ref_image,      # 图像提示
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]
result.save("ip_adapter_result.png")

# === 2. IP-Adapter FaceID(人脸一致性) ===
from insightface.app import FaceAnalysis

# 提取人脸特征
app = FaceAnalysis(name="buffalo_l", providers=["CUDAExecutionProvider"])
app.prepare(ctx_id=0, det_size=(640, 640))
face_image = load_image("https://example.com/face.jpg")
faces = app.get(np.array(face_image))
face_embedding = torch.tensor(faces[0].normed_embedding).unsqueeze(0)  # unsqueeze增加一个维度

pipe.load_ip_adapter(
    "h94/IP-Adapter-FaceID",
    subfolder="",
    weight_name="ip-adapter-faceid-plusv2_sdxl.bin",
)
pipe.set_ip_adapter_scale(0.7)

result_face = pipe(
    prompt="a professional portrait photo, studio lighting",
    ip_adapter_image_embeds=[face_embedding],
    num_inference_steps=30,
).images[0]

# === 3. IP-Adapter + ControlNet 组合 ===
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)
pipe_combo = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe_combo.to("cuda")

# 加载 IP-Adapter
pipe_combo.load_ip_adapter(
    "h94/IP-Adapter", subfolder="sdxl_models",
    weight_name="ip-adapter-plus_sdxl_vit-h.safetensors",
)
pipe_combo.set_ip_adapter_scale(0.5)

# 同时使用 IP-Adapter(风格)+ ControlNet(结构)
result_combo = pipe_combo(
    prompt="interior design, modern minimalist",
    image=canny_image,                          # ControlNet 条件
    ip_adapter_image=style_reference_image,      # IP-Adapter 风格
    controlnet_conditioning_scale=0.7,
    num_inference_steps=30,
).images[0]

8. 总结

8.1 核心技术回顾

技术 原理 应用场景
类别条件 将类别标签嵌入模型 分类控制生成
无分类器引导 条件与无条件预测的插值 提高条件生成的质量
文本到图像 使用文本编码器编码文本 文本控制生成
图像修复 在掩码区域进行扩散 图像修复、补全
图像编辑 基于文本编辑图像 图像风格迁移
ControlNet 零卷积 + 冻结UNet编码器副本 空间条件精细控制
IP-Adapter 解耦Cross-Attention图像注入 图像提示(风格/内容迁移)

8.2 最佳实践

  1. 从简单开始:先用类别条件练习
  2. 调整引导强度:找到合适的引导强度
  3. 文本提示优化:使用清晰、具体的文本描述
  4. 掩码设计:合理设计修复区域的掩码
  5. 迭代优化:逐步调整参数

8.3 学习建议

  1. 理解原理:先理解条件生成的数学原理
  2. 动手实现:亲自实现各种条件生成方法
  3. 对比实验:对比不同方法的效果
  4. 应用实践:在实际项目中应用

9. 推荐资源

论文

  • Classifier-Free Diffusion Guidance: "Classifier-Free Diffusion Guidance"
  • GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models"
  • Stable Diffusion: "High-Resolution Image Synthesis with Latent Diffusion Models"
  • ControlNet: "Adding Conditional Control to Text-to-Image Diffusion Models" (Zhang & Agrawala, 2023)
  • IP-Adapter: "IP-Adapter: Text Compatible Image Prompt Adapter" (Ye et al., 2023)

代码库

  • Hugging Face Diffusers
  • CompVis/stable-diffusion
  • OpenAI/glide-text2im
  • lllyasviel/ControlNet
  • tencent-ailab/IP-Adapter

10. 自测问题

  1. 条件扩散模型和无条件扩散模型有什么区别?
  2. 无分类器引导的原理是什么?
  3. 如何实现文本到图像生成?
  4. 图像修复和图像编辑有什么区别?
  5. 如何调整引导强度?
  6. ControlNet 中零卷积的作用是什么?为什么需要冻结原始 UNet?
  7. IP-Adapter 如何实现解耦的 Cross-Attention?与 ControlNet 有什么区别?
  8. InstantStyle 如何实现纯风格迁移而不改变内容?

下一章: 05-实战项目 - 将所学知识应用到实际项目中