03 - Q-Learning算法：Off-Policy TD控制¶

学习时间: 3-4小时 重要性: ⭐⭐⭐⭐⭐ 最广泛使用的强化学习算法 前置知识: SARSA算法、TD学习

🎯 学习目标¶

完成本章后，你将能够： - 理解Q-Learning的核心思想和Off-Policy特性 - 掌握Q-Learning与SARSA的关键区别 - 实现Q-Learning算法解决控制问题 - 分析Q-Learning的收敛性 - 应用Q-Learning解决实际问题

1. Q-Learning简介¶

1.1 历史背景¶

Q-Learning由Chris Watkins于1989年提出，是强化学习领域最具影响力的算法之一。

核心突破：

无需知道环境模型，直接学习最优动作值函数

1.2 核心思想¶

Off-Policy学习： - 行为策略（Behavior Policy）：用于探索环境的策略（如ε-贪婪） - 目标策略（Target Policy）：想要学习的策略（贪心策略）

关键洞察：

即使使用随机探索，也可以学习最优策略的值函数

1.3 Q-Learning更新规则¶

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)]\]

与SARSA的对比：

算法	更新目标	策略类型
SARSA	\(R + \gamma Q(s', a')\)	On-Policy
Q-Learning	\(R + \gamma \max_{a'} Q(s', a')\)	Off-Policy

关键区别： - SARSA使用实际执行的下一个动作a' - Q-Learning使用最优动作的Q值，不管实际执行什么

2. Q-Learning算法详解¶

2.1 算法流程¶

Text Only

初始化: Q(s,a) = 0, ∀s,a

对每个episode:
    初始化状态S

    当S不是终止状态时:
        使用ε-贪婪策略选择动作A
        执行动作A, 观察R, S'

        Q(S,A) ← Q(S,A) + α[R + γ max_a' Q(S',a') - Q(S,A)]

        S ← S'

2.2 代码实现¶

Python

import numpy as np
from collections import defaultdict

def q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """
    Q-Learning算法实现

    参数:
        env: 环境
        num_episodes: 训练的episode数量
        alpha: 学习率
        gamma: 折扣因子
        epsilon: ε-贪婪探索率

    返回:
        Q: 动作值函数
        policy: 学到的策略
        rewards_history: 每episode的总奖励
    """
    n_actions = env.get_action_space()
    Q = defaultdict(lambda: np.zeros(n_actions))  # defaultdict访问不存在的键时返回默认值
    rewards_history = []

    def epsilon_greedy(state, Q, epsilon):
        """ε-贪婪策略"""
        if np.random.random() < epsilon:
            return np.random.randint(n_actions)
        else:
            return np.argmax(Q[state])

    for episode in range(num_episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False

        while not done:
            # 选择动作（ε-贪婪）
            action = epsilon_greedy(state, Q, epsilon)

            # 执行动作
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            total_reward += reward

            # Q-Learning更新（Off-Policy）
            if not done:
                # 使用最大Q值，不管实际执行什么动作
                td_target = reward + gamma * np.max(Q[next_state])
            else:
                td_target = reward

            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error

            # 转移到下一个状态
            state = next_state

        rewards_history.append(total_reward)

        # 衰减探索率
        epsilon = max(0.01, epsilon * 0.995)

        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(rewards_history[-100:])
            print(f"Episode {episode + 1}/{num_episodes}, "
                  f"Avg Reward: {avg_reward:.2f}, Epsilon: {epsilon:.3f}")

    # 提取最终策略（贪心）
    policy = {}
    for state in Q.keys():
        policy[state] = np.argmax(Q[state])

    return Q, policy, rewards_history

2.3 Off-Policy的优势¶

为什么Off-Policy很有用？

数据重用：可以用旧策略收集的数据学习新策略
探索自由：可以大胆探索，不影响学习目标
学习最优策略：即使探索是随机的，也能学到最优策略

类比： - On-Policy = 边做边学，学什么做什么 - Off-Policy = 看别人做，学最优方法

3. Q-Learning的收敛性¶

3.1 收敛定理¶

定理（Q-Learning收敛性）：

在满足以下条件时，Q-Learning以概率1收敛到最优Q函数Q*：

所有状态-动作对被无限次访问
步长满足Robbins-Monro条件
行为策略是Greedy in the Limit（最终变得贪心）

注意：与SARSA不同，Q-Learning不需要GLIE策略！

3.2 为什么Q-Learning能收敛？¶

关键：贝尔曼最优算子的收缩性

\[\|TQ_1 - TQ_2\|_\infty \leq \gamma \|Q_1 - Q_2\|_\infty\]

Q-Learning是随机逼近这个不动点。

3.3 收敛速率¶

Q-Learning的收敛速率：\(O(1/t^{1/3})\)（比SARSA慢）

原因： - max操作引入了额外的非线性 - 需要更多样本来准确估计最大值

4. 实践应用¶

4.1 悬崖行走问题¶

Python

class CliffWalkingEnv:
    """悬崖行走环境"""

    def __init__(self, width=12, height=4):
        self.width = width
        self.height = height
        self.start = (3, 0)
        self.goal = (3, 11)

    def step(self, action):
        x, y = self.state

        # 动作: 0=上, 1=右, 2=下, 3=左
        if action == 0:
            x = max(0, x - 1)
        elif action == 1:
            y = min(self.width - 1, y + 1)
        elif action == 2:
            x = min(self.height - 1, x + 1)
        elif action == 3:
            y = max(0, y - 1)

        # 检查是否掉下悬崖
        if x == 3 and 1 <= y <= 10:
            reward = -100
            done = True
            self.state = self.start  # 回到起点
        elif (x, y) == self.goal:
            reward = -1
            done = True
            self.state = (x, y)
        else:
            reward = -1
            done = False
            self.state = (x, y)

        return self._get_state(), reward, done, False, {}

# 训练
env = CliffWalkingEnv()
Q, policy, rewards = q_learning(env, num_episodes=500, alpha=0.5, epsilon=0.1)

# 对比SARSA和Q-Learning
print("Q-Learning学到的策略（注意紧贴悬崖）:")
visualize_policy(policy, env)

4.2 学习曲线可视化¶

Python

import matplotlib.pyplot as plt

def plot_learning_curve(rewards, title="Q-Learning Training"):
    """绘制学习曲线"""
    plt.figure(figsize=(12, 5))

    # 原始奖励
    plt.subplot(1, 2, 1)
    plt.plot(rewards, alpha=0.3, label='Raw')

    # 移动平均
    window = 50
    if len(rewards) >= window:
        moving_avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
        plt.plot(range(window-1, len(rewards)), moving_avg,
                label=f'MA({window})', linewidth=2)

    plt.xlabel('Episode')
    plt.ylabel('Total Reward')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)

    # 奖励分布
    plt.subplot(1, 2, 2)
    plt.hist(rewards, bins=30, edgecolor='black', alpha=0.7)
    plt.xlabel('Total Reward')
    plt.ylabel('Frequency')
    plt.title('Reward Distribution')
    plt.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

# 使用
plot_learning_curve(rewards, "Q-Learning on Cliff Walking")

5. Q-Learning的变体¶

5.1 Double Q-Learning¶

问题：Q-Learning的max操作导致过度估计（Overestimation）

解决方案：使用两个Q表

Python

def double_q_learning(env, num_episodes=1000, alpha=0.1, gamma=0.9, epsilon=0.1):
    """Double Q-Learning"""
    n_actions = env.get_action_space()
    Q1 = defaultdict(lambda: np.zeros(n_actions))
    Q2 = defaultdict(lambda: np.zeros(n_actions))

    def epsilon_greedy(state, Q1, Q2, epsilon):
        """基于平均Q值的ε-贪婪"""
        Q_avg = (Q1[state] + Q2[state]) / 2
        if np.random.random() < epsilon:
            return np.random.randint(n_actions)
        return np.argmax(Q_avg)

    for episode in range(num_episodes):
        state, _ = env.reset()
        done = False

        while not done:
            action = epsilon_greedy(state, Q1, Q2, epsilon)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # 随机选择更新哪个Q表
            if np.random.random() < 0.5:
                # 更新Q1
                if not done:
                    best_action = np.argmax(Q1[next_state])
                    td_target = reward + gamma * Q2[next_state][best_action]
                else:
                    td_target = reward
                Q1[state][action] += alpha * (td_target - Q1[state][action])
            else:
                # 更新Q2
                if not done:
                    best_action = np.argmax(Q2[next_state])
                    td_target = reward + gamma * Q1[next_state][best_action]
                else:
                    td_target = reward
                Q2[state][action] += alpha * (td_target - Q2[state][action])

            state = next_state

    # 合并两个Q表
    Q = defaultdict(lambda: np.zeros(n_actions))
    for state in set(list(Q1.keys()) + list(Q2.keys())):
        Q[state] = (Q1[state] + Q2[state]) / 2

    return Q

5.2 延迟Q-Learning¶

思想：只在Q值变化足够大时才更新，减少计算。

6. 本章总结¶

核心概念¶

Text Only

Q-Learning:
├── Off-Policy: 行为策略 ≠ 目标策略
├── 更新: Q(s,a) ← Q(s,a) + α[R + γ max_a' Q(s',a') - Q(s,a)]
├── 使用最优动作的Q值
├── 更激进，追求最优
└── 可能过度估计（可用Double Q-Learning解决）

收敛性:
├── 不需要GLIE策略
├── 收敛到Q*
└── 收敛速率: O(1/t^{1/3})

学习路径¶

Text Only

下一步:
├── 探索与利用 - 更深入的探索策略
├── 多步方法 - n-step和TD(λ)
└── 函数近似 - 处理大规模问题

✅ 自测问题¶

为什么Q-Learning被称为"Off-Policy"算法？与SARSA有什么本质区别？
Q-Learning的max操作会导致什么问题？Double Q-Learning如何解决？
在悬崖行走问题中，为什么Q-Learning学到的策略比SARSA更危险？
Q-Learning的收敛条件与SARSA有什么不同？为什么？
设计一个实验验证Q-Learning的过度估计问题。

📚 延伸阅读¶

Watkins (1989)
"Learning from Delayed Rewards"
Q-Learning的原始博士论文
Watkins & Dayan (1992)
"Q-learning"
Q-Learning收敛性的经典证明
Hasselt (2010)
"Double Q-learning"
解决过度估计问题

掌握了Q-Learning，你已经学会了最重要的强化学习算法！

→ 下一步：04-探索与利用.md