04 - 模型基方法¶
学习时间: 3-4小时 重要性: ⭐⭐⭐⭐ 结合模型学习与规划 前置知识: Dyna-Q、模型学习
🎯 学习目标¶
完成本章后,你将能够: - 理解模型基vs无模型方法的区别 - 掌握Dyna-Q算法 - 学习模型学习方法 - 了解MBMF等高级方法 - 应用模型基方法提高效率
1. 模型基方法简介¶
1.1 核心思想¶
学习环境的转移模型: $\(\hat{P}(s'|s,a) \approx P(s'|s,a)\)$ $\(\hat{R}(s,a) \approx R(s,a)\)$
优势: - 样本效率更高 - 可以规划 - 更好的泛化
挑战: - 模型误差累积 - 计算成本高
1.2 分类¶
Text Only
模型基方法
├── 学习模型 + 规划
│ ├── Dyna-Q
│ ├── MBMF
│ └── MBPO
├── 学习模型 + 策略优化
│ ├── SVG
│ └── MAAC
└── 隐式模型
├── MuZero
└── Dreamer
2. Dyna-Q¶
2.1 核心思想¶
结合学习与规划: 1. 从真实经验学习 2. 从模型生成模拟经验 3. 用模拟经验继续学习
2.2 算法流程¶
Python
import numpy as np
from collections import defaultdict
class DynaQ:
"""Dyna-Q算法"""
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.9,
epsilon=0.1, n_planning=10):
self.n_states = n_states
self.n_actions = n_actions
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.n_planning = n_planning
# Q表
self.Q = np.zeros((n_states, n_actions))
# 模型
self.model = {} # (s,a) -> (r, s')
self.visited = set() # 记录访问过的状态-动作对
def select_action(self, state):
"""ε-贪婪策略"""
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return np.argmax(self.Q[state])
def learn(self, state, action, reward, next_state, done):
"""学习(真实经验)"""
# Q-Learning更新
if not done:
td_target = reward + self.gamma * np.max(self.Q[next_state])
else:
td_target = reward
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.alpha * td_error
# 更新模型
self.model[(state, action)] = (reward, next_state, done)
self.visited.add((state, action))
def planning(self):
"""规划(模拟经验)"""
for _ in range(self.n_planning):
if len(self.visited) == 0:
break
# 随机采样之前访问过的状态-动作对
state, action = list(self.visited)[np.random.randint(len(self.visited))]
reward, next_state, done = self.model[(state, action)]
# Q-Learning更新(使用模型生成的经验)
if not done:
td_target = reward + self.gamma * np.max(self.Q[next_state])
else:
td_target = reward
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.alpha * td_error
def train(self, env, num_episodes=1000):
"""训练"""
rewards_history = []
for episode in range(num_episodes):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = self.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# 从真实经验学习
self.learn(state, action, reward, next_state, done)
# 规划
self.planning()
total_reward += reward
state = next_state
rewards_history.append(total_reward)
if (episode + 1) % 100 == 0:
avg_reward = np.mean(rewards_history[-100:])
print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")
return rewards_history
2.3 Dyna-Q vs Q-Learning¶
| 特性 | Q-Learning | Dyna-Q |
|---|---|---|
| 样本效率 | 低 | 高 |
| 计算成本 | 低 | 中 |
| 内存需求 | 低 | 中 |
| 适用场景 | 简单环境 | 复杂环境 |
3. 神经网络模型学习¶
3.1 确定性模型¶
Python
import torch
import torch.nn as nn
import torch.optim as optim
class DeterministicModel(nn.Module): # 继承nn.Module定义网络层
"""确定性转移模型"""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super(DeterministicModel, self).__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, state_dim)
)
def forward(self, state, action):
x = torch.cat([state, action], dim=-1) # torch.cat沿已有维度拼接张量
next_state_delta = self.net(x)
return state + next_state_delta
class ModelLearner:
"""模型学习器"""
def __init__(self, state_dim, action_dim, lr=1e-3):
self.model = DeterministicModel(state_dim, action_dim)
self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
self.loss_fn = nn.MSELoss()
def train_step(self, states, actions, next_states):
"""训练模型一步"""
predicted_next = self.model(states, actions)
loss = self.loss_fn(predicted_next, next_states)
self.optimizer.zero_grad() # 清零梯度
loss.backward() # 反向传播计算梯度
self.optimizer.step() # 更新参数
return loss.item() # 将单元素张量转为Python数值
def predict(self, state, action):
"""预测下一个状态"""
with torch.no_grad(): # 禁用梯度计算,节省内存
state_tensor = torch.FloatTensor(state).unsqueeze(0) # unsqueeze增加一个维度
action_tensor = torch.FloatTensor(action).unsqueeze(0)
next_state = self.model(state_tensor, action_tensor)
return next_state.squeeze(0).numpy() # squeeze压缩维度
3.2 概率模型¶
Python
class ProbabilisticModel(nn.Module):
"""概率转移模型(高斯)"""
def __init__(self, state_dim, action_dim, hidden_dim=256):
super(ProbabilisticModel, self).__init__()
self.net = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.mean = nn.Linear(hidden_dim, state_dim)
self.log_std = nn.Linear(hidden_dim, state_dim)
def forward(self, state, action):
x = torch.cat([state, action], dim=-1)
features = self.net(x)
mean = self.mean(features)
log_std = self.log_std(features)
log_std = torch.clamp(log_std, -5, 2)
return mean, log_std
def sample(self, state, action):
"""采样下一个状态"""
mean, log_std = self.forward(state, action)
std = torch.exp(log_std)
noise = torch.randn_like(mean)
next_state = mean + std * noise
return next_state
4. MBMF:基于模型的价值扩展¶
4.1 核心思想¶
短程模型 + 长程价值: - 使用学习模型进行H步规划 - 之后使用价值函数估计
\[Q(s,a) = \sum_{t=0}^{H-1} \gamma^t r_t + \gamma^H V(s_H)\]
4.2 代码实现¶
Python
class MBMF:
"""Model-Based Value Expansion"""
def __init__(self, model, q_network, horizon=5):
self.model = model
self.q_network = q_network
self.horizon = horizon
def expand_value(self, state, action):
"""
扩展价值估计
使用模型进行H步模拟,然后使用Q网络估计
"""
total_reward = 0
current_state = state
# H步模型展开
for t in range(self.horizon):
# 使用模型预测
next_state, reward = self.model.step(current_state, action)
total_reward += (self.gamma ** t) * reward
# 选择下一步动作(使用当前策略)
action = self.select_action(next_state)
current_state = next_state
# 使用Q网络估计剩余价值
final_q = self.q_network(current_state, action)
total_reward += (self.gamma ** self.horizon) * final_q
return total_reward
5. 模型基方法总结¶
方法对比¶
| 方法 | 模型类型 | 规划方式 | 适用场景 |
|---|---|---|---|
| Dyna-Q | 表格 | 随机采样 | 离散环境 |
| MBMF | 神经网络 | 短程展开 | 连续环境 |
| MBPO | 神经网络 | 分支展开 | 复杂环境 |
| MuZero | 隐式 | MCTS | 复杂任务 |
核心概念¶
Text Only
模型基方法:
├── 学习模型: P(s'|s,a), R(s,a)
├── 规划: 使用模型生成模拟经验
└── 结合: 真实经验 + 模拟经验
优势:
├── 样本效率高
├── 可规划
└── 更好的泛化
挑战:
├── 模型误差
└── 计算成本
✅ 自测问题¶
-
模型基方法与无模型方法相比有什么优势和劣势?
-
Dyna-Q中的n_planning参数有什么作用?
-
如何处理模型误差?
📚 延伸阅读¶
- Sutton (1990) - Dyna-Q
- Feinberg et al. (2018) - MBMF
- Janner et al. (2019) - MBPO
→ 下一步:05-分布式RL.md