跳转至

02 - SAC算法:软演员-评论家

学习时间: 4-5小时 重要性: ⭐⭐⭐⭐⭐ 最先进的Off-Policy算法 前置知识: Actor-Critic、最大熵RL


🎯 学习目标

完成本章后,你将能够: - 理解最大熵强化学习框架 - 掌握SAC的核心组件(双Q网络、自动温度调节) - 实现完整的SAC算法 - 应用SAC解决连续控制问题


1. 最大熵强化学习

1.1 标准RL vs 最大熵RL

标准RL目标: $\(J(\pi) = \sum_t \mathbb{E}[r(s_t, a_t)]\)$

最大熵RL目标: $\(J(\pi) = \sum_t \mathbb{E}[r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))]\)$

优势: - 鼓励探索 - 更鲁棒(对模型误差不敏感) - 可以学习多模态策略

1.2 软Q函数

软贝尔曼方程: $\(Q(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}[V(s_{t+1})]\)$

软值函数: $\(V(s_t) = \mathbb{E}_{a_t \sim \pi}[Q(s_t, a_t) - \alpha \log \pi(a_t|s_t)]\)$


2. SAC算法

2.1 核心组件

  1. Actor:高斯策略网络
  2. 双Critic:两个Q网络(减少过估计)
  3. 温度参数:自动调节探索程度

2.2 代码实现

Python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
import numpy as np

class Actor(nn.Module):  # 继承nn.Module定义网络层
    """高斯策略网络"""

    def __init__(self, state_dim, action_dim, hidden_dim=256, log_std_min=-20, log_std_max=2):
        super(Actor, self).__init__()

        self.log_std_min = log_std_min
        self.log_std_max = log_std_max

        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )

        self.mean = nn.Linear(hidden_dim, action_dim)
        self.log_std = nn.Linear(hidden_dim, action_dim)

    def forward(self, state):
        x = self.net(state)
        mean = self.mean(x)
        log_std = self.log_std(x)
        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
        return mean, log_std

    def sample(self, state):
        """采样动作(重参数化技巧)"""
        mean, log_std = self.forward(state)
        std = log_std.exp()

        # 重参数化
        normal = Normal(mean, std)
        x_t = normal.rsample()
        action = torch.tanh(x_t)

        # 计算log_prob(考虑tanh的雅可比行列式)
        log_prob = normal.log_prob(x_t)
        log_prob -= torch.log(1 - action.pow(2) + 1e-6)
        log_prob = log_prob.sum(1, keepdim=True)

        return action, log_prob, mean

class Critic(nn.Module):
    """Q网络"""

    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super(Critic, self).__init__()

        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(self, state, action):
        x = torch.cat([state, action], 1)  # torch.cat沿已有维度拼接张量
        return self.net(x)

class SAC:
    """SAC算法"""

    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
                 tau=0.005, alpha=0.2, automatic_entropy_tuning=True):

        self.gamma = gamma
        self.tau = tau
        self.automatic_entropy_tuning = automatic_entropy_tuning

        # Actor
        self.actor = Actor(state_dim, action_dim)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)

        # Critic(双Q网络)
        self.critic1 = Critic(state_dim, action_dim)
        self.critic2 = Critic(state_dim, action_dim)
        self.critic1_optimizer = optim.Adam(self.critic1.parameters(), lr=lr)
        self.critic2_optimizer = optim.Adam(self.critic2.parameters(), lr=lr)

        # 目标网络
        self.critic1_target = Critic(state_dim, action_dim)
        self.critic2_target = Critic(state_dim, action_dim)
        self.critic1_target.load_state_dict(self.critic1.state_dict())
        self.critic2_target.load_state_dict(self.critic2.state_dict())

        # 自动温度调节
        if self.automatic_entropy_tuning:
            self.target_entropy = -action_dim
            self.log_alpha = torch.zeros(1, requires_grad=True)
            self.alpha_optimizer = optim.Adam([self.log_alpha], lr=lr)
            self.alpha = self.log_alpha.exp()
        else:
            self.alpha = alpha

    def select_action(self, state, evaluate=False):
        """选择动作"""
        state = torch.FloatTensor(state).unsqueeze(0)  # unsqueeze增加一个维度

        if evaluate:
            # 确定性动作:取均值并通过tanh映射
            _, _, mean = self.actor.sample(state)
            action = torch.tanh(mean)
        else:
            action, _, _ = self.actor.sample(state)

        return action.detach().cpu().numpy()[0]  # 分离计算图,不参与梯度计算

    def update(self, batch):
        """更新网络"""
        states, actions, rewards, next_states, dones = batch

        states = torch.FloatTensor(states)
        actions = torch.FloatTensor(actions)
        rewards = torch.FloatTensor(rewards).unsqueeze(1)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones).unsqueeze(1)

        # 更新Critic
        with torch.no_grad():  # 禁用梯度计算,节省内存
            next_actions, next_log_probs, _ = self.actor.sample(next_states)

            target_q1 = self.critic1_target(next_states, next_actions)
            target_q2 = self.critic2_target(next_states, next_actions)
            target_q = torch.min(target_q1, target_q2) - self.alpha * next_log_probs

            target_q = rewards + (1 - dones) * self.gamma * target_q

        current_q1 = self.critic1(states, actions)
        current_q2 = self.critic2(states, actions)

        critic1_loss = F.mse_loss(current_q1, target_q)  # F.xxx PyTorch函数式API
        critic2_loss = F.mse_loss(current_q2, target_q)

        self.critic1_optimizer.zero_grad()  # 清零梯度
        critic1_loss.backward()  # 反向传播计算梯度
        self.critic1_optimizer.step()  # 更新参数

        self.critic2_optimizer.zero_grad()
        critic2_loss.backward()
        self.critic2_optimizer.step()

        # 更新Actor
        new_actions, log_probs, _ = self.actor.sample(states)

        q1_new = self.critic1(states, new_actions)
        q2_new = self.critic2(states, new_actions)
        q_new = torch.min(q1_new, q2_new)

        actor_loss = (self.alpha.detach() * log_probs - q_new).mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # 更新温度参数
        if self.automatic_entropy_tuning:
            alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean()

            self.alpha_optimizer.zero_grad()
            alpha_loss.backward()
            self.alpha_optimizer.step()

            self.alpha = self.log_alpha.exp()

        # 软更新目标网络
        for param, target_param in zip(self.critic1.parameters(), self.critic1_target.parameters()):  # zip按位置配对
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

        for param, target_param in zip(self.critic2.parameters(), self.critic2_target.parameters()):
            target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

3. 关键技巧

3.1 双Q网络

  • 使用两个Q网络,取最小值
  • 减少Q值过估计

3.2 重参数化技巧

  • 允许通过随机节点反向传播
  • 降低策略梯度方差

3.3 自动温度调节

目标:自动调整 \(\alpha\),使策略熵保持在目标值 \(\bar{\mathcal{H}}\) 附近。

3.3.1 从约束优化到对偶问题(完整推导)

步骤1:带约束的策略优化问题

SAC希望最大化奖励的同时保证策略的熵不低于目标值 \(\bar{\mathcal{H}}\)

\[\max_\pi \mathbb{E}\left[\sum_t r(s_t, a_t)\right] \quad \text{s.t.} \quad \mathbb{E}_{(s_t,a_t)\sim\pi}[-\log \pi(a_t|s_t)] \geq \bar{\mathcal{H}}, \;\forall t\]

即要求每个时间步的策略熵 \(\mathcal{H}(\pi(\cdot|s_t)) \geq \bar{\mathcal{H}}\)

步骤2:拉格朗日函数

引入拉格朗日乘子 \(\alpha \geq 0\)(即温度参数),构造拉格朗日函数:

\[\mathcal{L}(\pi, \alpha) = \mathbb{E}\left[\sum_t r(s_t, a_t) + \alpha\left(\mathcal{H}(\pi(\cdot|s_t)) - \bar{\mathcal{H}}\right)\right]\]
\[= \mathbb{E}\left[\sum_t r(s_t, a_t) - \alpha \log \pi(a_t|s_t) - \alpha \bar{\mathcal{H}}\right]\]

步骤3:对偶问题

根据拉格朗日对偶理论,原始问题等价于:

\[\min_{\alpha \geq 0} \max_\pi \mathcal{L}(\pi, \alpha)\]

固定 \(\alpha\) 时,内层 \(\max_\pi\) 就是标准的最大熵RL问题(即SAC的策略更新)。固定 \(\pi\) 时,外层对 \(\alpha\) 的优化为:

\[\min_\alpha \; \mathbb{E}_{a_t \sim \pi_t}\left[- \alpha \log \pi_t(a_t|s_t) - \alpha \bar{\mathcal{H}}\right]\]

步骤4:温度损失函数

因此温度参数 \(\alpha\) 的损失函数为(对偶问题的目标):

\[J(\alpha) = \mathbb{E}_{a_t \sim \pi_t}[-\alpha \log \pi_t(a_t|s_t) - \alpha \bar{\mathcal{H}}]\]
\[= \alpha \cdot \mathbb{E}_{a_t \sim \pi_t}[-\log \pi_t(a_t|s_t) - \bar{\mathcal{H}}]\]

步骤5:梯度与更新规则

\(\alpha\) 求梯度:

\[\nabla_\alpha J(\alpha) = \mathbb{E}_{a_t \sim \pi_t}[-\log \pi_t(a_t|s_t) - \bar{\mathcal{H}}] = \mathbb{E}_{a_t \sim \pi_t}[\mathcal{H}(\pi) - \bar{\mathcal{H}}]\]

(实际实现中对 \(\log \alpha\) 优化以保证 \(\alpha > 0\)

\[\alpha \leftarrow \alpha - \lambda_\alpha \nabla_\alpha J(\alpha) = \alpha + \lambda_\alpha \mathbb{E}_{a_t \sim \pi_t}[\log \pi_t(a_t|s_t) + \bar{\mathcal{H}}]\]

其中目标熵通常设置为 \(\bar{\mathcal{H}} = -\dim(\mathcal{A})\)(动作空间的负维度)。

直观理解: - 当策略熵 \(> \bar{\mathcal{H}}\)(探索过度):\(\nabla_\alpha J > 0\),梯度下降使 \(\alpha\) 减小,减弱熵正则化 - 当策略熵 \(< \bar{\mathcal{H}}\)(探索不足):\(\nabla_\alpha J < 0\),梯度下降使 \(\alpha\) 增大,加强熵正则化 - 平衡点:策略熵恰好等于 \(\bar{\mathcal{H}}\) 时梯度为零


4. 本章总结

核心概念

Text Only
SAC:
├── 最大熵框架: 奖励 + α·熵
├── 双Q网络: 减少过估计
├── 重参数化: 低方差梯度
└── 自动温度: 自适应探索

优势:
├── 样本效率高(Off-Policy)
├── 稳定训练
└── 连续动作空间

✅ 自测问题

  1. 最大熵RL与标准RL有什么区别?

  2. 为什么要使用双Q网络?

  3. 自动温度调节如何工作?


📚 延伸阅读

  1. Haarnoja et al. (2018)
  2. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning"

→ 下一步:03-TRPO算法.md