02 - SAC算法:软演员-评论家¶
学习时间: 4-5小时 重要性: ⭐⭐⭐⭐⭐ 最先进的Off-Policy算法 前置知识: Actor-Critic、最大熵RL
🎯 学习目标¶
完成本章后,你将能够: - 理解最大熵强化学习框架 - 掌握SAC的核心组件(双Q网络、自动温度调节) - 实现完整的SAC算法 - 应用SAC解决连续控制问题
1. 最大熵强化学习¶
1.1 标准RL vs 最大熵RL¶
标准RL目标: $\(J(\pi) = \sum_t \mathbb{E}[r(s_t, a_t)]\)$
最大熵RL目标: $\(J(\pi) = \sum_t \mathbb{E}[r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))]\)$
优势: - 鼓励探索 - 更鲁棒(对模型误差不敏感) - 可以学习多模态策略
1.2 软Q函数¶
软贝尔曼方程: $\(Q(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}[V(s_{t+1})]\)$
软值函数: $\(V(s_t) = \mathbb{E}_{a_t \sim \pi}[Q(s_t, a_t) - \alpha \log \pi(a_t|s_t)]\)$
2. SAC算法¶
2.1 核心组件¶
- Actor:高斯策略网络
- 双Critic:两个Q网络(减少过估计)
- 温度参数:自动调节探索程度
2.2 代码实现¶
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
import numpy as np
class Actor(nn.Module): # 继承nn.Module定义网络层
"""高斯策略网络"""
def __init__(self, state_dim, action_dim, hidden_dim=256, log_std_min=-20, log_std_max=2):
super(Actor, self).__init__()
self.log_std_min = log_std_min
self.log_std_max = log_std_max
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.mean = nn.Linear(hidden_dim, action_dim)
self.log_std = nn.Linear(hidden_dim, action_dim)
def forward(self, state):
x = self.net(state)
mean = self.mean(x)
log_std = self.log_std(x)
log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
return mean, log_std
def sample(self, state):
"""采样动作(重参数化技巧)"""
mean, log_std = self.forward(state)
std = log_std.exp()
# 重参数化
normal = Normal(mean, std)
x_t = normal.rsample()
action = torch.tanh(x_t)
# 计算log_prob(考虑tanh的雅可比行列式)
log_prob = normal.log_prob(x_t)
log_prob -= torch.log(1 - action.pow(2) + 1e-6)
log_prob = log_prob.sum(1, keepdim=True)
return action, log_prob, mean
class Critic(nn.Module):
"""Q网络"""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super(Critic, self).__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
def forward(self, state, action):
x = torch.cat([state, action], 1) # torch.cat沿已有维度拼接张量
return self.net(x)
class SAC:
"""SAC算法"""
def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
tau=0.005, alpha=0.2, automatic_entropy_tuning=True):
self.gamma = gamma
self.tau = tau
self.automatic_entropy_tuning = automatic_entropy_tuning
# Actor
self.actor = Actor(state_dim, action_dim)
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr)
# Critic(双Q网络)
self.critic1 = Critic(state_dim, action_dim)
self.critic2 = Critic(state_dim, action_dim)
self.critic1_optimizer = optim.Adam(self.critic1.parameters(), lr=lr)
self.critic2_optimizer = optim.Adam(self.critic2.parameters(), lr=lr)
# 目标网络
self.critic1_target = Critic(state_dim, action_dim)
self.critic2_target = Critic(state_dim, action_dim)
self.critic1_target.load_state_dict(self.critic1.state_dict())
self.critic2_target.load_state_dict(self.critic2.state_dict())
# 自动温度调节
if self.automatic_entropy_tuning:
self.target_entropy = -action_dim
self.log_alpha = torch.zeros(1, requires_grad=True)
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=lr)
self.alpha = self.log_alpha.exp()
else:
self.alpha = alpha
def select_action(self, state, evaluate=False):
"""选择动作"""
state = torch.FloatTensor(state).unsqueeze(0) # unsqueeze增加一个维度
if evaluate:
# 确定性动作:取均值并通过tanh映射
_, _, mean = self.actor.sample(state)
action = torch.tanh(mean)
else:
action, _, _ = self.actor.sample(state)
return action.detach().cpu().numpy()[0] # 分离计算图,不参与梯度计算
def update(self, batch):
"""更新网络"""
states, actions, rewards, next_states, dones = batch
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
rewards = torch.FloatTensor(rewards).unsqueeze(1)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones).unsqueeze(1)
# 更新Critic
with torch.no_grad(): # 禁用梯度计算,节省内存
next_actions, next_log_probs, _ = self.actor.sample(next_states)
target_q1 = self.critic1_target(next_states, next_actions)
target_q2 = self.critic2_target(next_states, next_actions)
target_q = torch.min(target_q1, target_q2) - self.alpha * next_log_probs
target_q = rewards + (1 - dones) * self.gamma * target_q
current_q1 = self.critic1(states, actions)
current_q2 = self.critic2(states, actions)
critic1_loss = F.mse_loss(current_q1, target_q) # F.xxx PyTorch函数式API
critic2_loss = F.mse_loss(current_q2, target_q)
self.critic1_optimizer.zero_grad() # 清零梯度
critic1_loss.backward() # 反向传播计算梯度
self.critic1_optimizer.step() # 更新参数
self.critic2_optimizer.zero_grad()
critic2_loss.backward()
self.critic2_optimizer.step()
# 更新Actor
new_actions, log_probs, _ = self.actor.sample(states)
q1_new = self.critic1(states, new_actions)
q2_new = self.critic2(states, new_actions)
q_new = torch.min(q1_new, q2_new)
actor_loss = (self.alpha.detach() * log_probs - q_new).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# 更新温度参数
if self.automatic_entropy_tuning:
alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean()
self.alpha_optimizer.zero_grad()
alpha_loss.backward()
self.alpha_optimizer.step()
self.alpha = self.log_alpha.exp()
# 软更新目标网络
for param, target_param in zip(self.critic1.parameters(), self.critic1_target.parameters()): # zip按位置配对
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
for param, target_param in zip(self.critic2.parameters(), self.critic2_target.parameters()):
target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
3. 关键技巧¶
3.1 双Q网络¶
- 使用两个Q网络,取最小值
- 减少Q值过估计
3.2 重参数化技巧¶
- 允许通过随机节点反向传播
- 降低策略梯度方差
3.3 自动温度调节¶
目标:自动调整 \(\alpha\),使策略熵保持在目标值 \(\bar{\mathcal{H}}\) 附近。
3.3.1 从约束优化到对偶问题(完整推导)¶
步骤1:带约束的策略优化问题
SAC希望最大化奖励的同时保证策略的熵不低于目标值 \(\bar{\mathcal{H}}\):
即要求每个时间步的策略熵 \(\mathcal{H}(\pi(\cdot|s_t)) \geq \bar{\mathcal{H}}\)。
步骤2:拉格朗日函数
引入拉格朗日乘子 \(\alpha \geq 0\)(即温度参数),构造拉格朗日函数:
步骤3:对偶问题
根据拉格朗日对偶理论,原始问题等价于:
固定 \(\alpha\) 时,内层 \(\max_\pi\) 就是标准的最大熵RL问题(即SAC的策略更新)。固定 \(\pi\) 时,外层对 \(\alpha\) 的优化为:
步骤4:温度损失函数
因此温度参数 \(\alpha\) 的损失函数为(对偶问题的目标):
步骤5:梯度与更新规则
对 \(\alpha\) 求梯度:
(实际实现中对 \(\log \alpha\) 优化以保证 \(\alpha > 0\))
其中目标熵通常设置为 \(\bar{\mathcal{H}} = -\dim(\mathcal{A})\)(动作空间的负维度)。
直观理解: - 当策略熵 \(> \bar{\mathcal{H}}\)(探索过度):\(\nabla_\alpha J > 0\),梯度下降使 \(\alpha\) 减小,减弱熵正则化 - 当策略熵 \(< \bar{\mathcal{H}}\)(探索不足):\(\nabla_\alpha J < 0\),梯度下降使 \(\alpha\) 增大,加强熵正则化 - 平衡点:策略熵恰好等于 \(\bar{\mathcal{H}}\) 时梯度为零
4. 本章总结¶
核心概念¶
SAC:
├── 最大熵框架: 奖励 + α·熵
├── 双Q网络: 减少过估计
├── 重参数化: 低方差梯度
└── 自动温度: 自适应探索
优势:
├── 样本效率高(Off-Policy)
├── 稳定训练
└── 连续动作空间
✅ 自测问题¶
-
最大熵RL与标准RL有什么区别?
-
为什么要使用双Q网络?
-
自动温度调节如何工作?
📚 延伸阅读¶
- Haarnoja et al. (2018)
- "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning"
→ 下一步:03-TRPO算法.md