RL【10-1】：Actor - Critic-EW帮帮网

系列文章目录

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Introduction

Actor-critic methods are still policy gradient methods.

They emphasize the structure that incorporates the policy gradient and
value-based methods.

What are “actor” and “critic”?

Here, “actor” refers to policy update. It is called actor is because the policies will be applied to take actions.
Here, “critic” refers to policy evaluation or value estimation. It is called critic because it criticizes the policy by evaluating it.

Actor 和 Critic 的角色和功能可以这么理解

Actor（行动者）

功能：

Actor 负责决策，即根据当前状态 $s$ 输出动作 $a$ 的概率分布（策略）。

数学形式：

通常表示为一个参数化策略 $\pi_\theta(a|s)$ ，参数 $\theta$ 来自神经网络。

直观类比：

相当于一个“演员”，在舞台上看到当前环境状态后，决定下一步要表演的动作。

输出：

离散动作环境 → 各动作的概率分布。

连续动作环境 → 动作的均值与方差。

Critic（评论家）

功能：

Critic 负责评价 Actor 的动作选择得好不好。它通过估计 价值函数（Value Function）来衡量某状态或状态–动作对的长期回报。

数学形式：

状态值函数 $V^\pi(s)$

或动作值函数 $Q^\pi(s,a)$

Critic 通过比较 实际回报 和 预测值 来给 Actor 提供梯度信号。

直观类比：

相当于一个“评论家”，不表演，但会点评“刚刚那个动作好/不好”，并给出改进方向。

Actor–Critic 的交互流程

Actor 决策：根据状态 $s_t$ 选取动作 $a_t$ 。

环境反馈：环境返回奖励 $r_t$ 和下一个状态 $s_{t+1}$ 。

Critic 评价：通过 TD(Temporal Difference) 误差 $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ 来衡量好坏。

Actor 更新：利用 Critic 给的信号（ $\delta_t$ ）来更新策略参数 $\theta$ 。

The simplest actor-critic (QAC)

Revisit the idea of policy gradient

Revisit the idea of policy gradient
1. A scalar metric $J(\theta)$ , which can be $\bar v_\pi$ or $\bar r_\pi$ .
2. The gradient-ascent algorithm maximizing $J(\theta)$ is
  
  $\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)$
  
  $\theta_t + \alpha \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big]$
3. The stochastic gradient-ascent algorithm is
  
  $\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t,a_t)$
We can see “actor” and “critic” from this algorithm:
- This algorithm corresponds to actor!
- The algorithm estimating $q_t(s,a)$ corresponds to critic!

从 Policy Gradient 到 Actor-Critic

在 Policy Gradient (PG) 方法中，我们的目标是最大化某个指标（如 $\bar v_\pi$ 或 $\bar r_\pi$ ）：

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) = \theta_t + \alpha \mathbb{E}{S \sim \eta, A \sim \pi}\big[\nabla\theta \ln \pi(A|S,\theta_t) q_\pi(S,A)\big]$

Actor 部分： $\nabla_\theta \ln \pi(A|S,\theta)$ ，决定如何更新策略参数 $\theta$ ；

Critic 部分： $q_\pi(S,A)$ ，决定给 Actor 提供的学习信号（Critic 来评估 当前动作的价值，并把这个评估结果作为学习信号反馈给 Actor）。

但是：

$q_\pi(s,a)$ 在真实环境中是未知的 → 需要估计。

如果用 Monte Carlo 方法估计，就得到 REINFORCE；

如果用函数逼近 + 时序差分 (TD) 来估计，就得到 Actor-Critic。

The simplest actor-critic algorithm (QAC)

Aim: Search for an optimal policy by maximizing $J(\theta)$ .
At time step $t$ in each episode, do:
- Generate $a_t$ following $\pi(a|s_t,\theta_t)$ , observe $r_{t+1}, s_{t+1}$ , and then generate $a_{t+1}$ following $\pi(a|s_{t+1}, \theta_t)$ .
- Critic (value update):
  
  $w_{t+1} = w_t + \alpha_w \big[ r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_t) - q(s_t,a_t,w_t) \big] \nabla_w q(s_t,a_t,w_t)$
- Actor (policy update):
  
  $\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q(s_t,a_t,w_{t+1})$

Actor-Critic 框架 (QAC)

在 Q Actor-Critic (QAC) 中：

Actor（策略更新器）：

根据 Critic 提供的 $q (s, a)$ 更新策略参数 $\theta$ ：

$\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q(s_t,a_t,w_{t+1})$

→ 这一步是 策略梯度更新，提高高价值动作的概率。

Critic（价值评估器）：

使用 TD 方法来更新 $q (s, a)$ 的参数 $w$ ：

$w_{t+1} = w_t + \alpha_w \big[ r_{t+1} + \gamma q(s_{t+1},a_{t+1},w_t) - q(s_t,a_t,w_t) \big] \nabla_w q(s_t,a_t,w_t)$

→ 这一步是 价值函数近似，修正对 $q (s, a)$ 的估计。

Remarks:

The critic corresponds to “SARSA + value function approximation”.
The actor corresponds to the policy update algorithm.
The algorithm is on-policy (why is PG on-policy?).
- Since the policy is stochastic, no need to use techniques like $\varepsilon$ -greedy.
This particular actor-critic algorithm is sometimes referred to as Q Actor-Critic (QAC).
Though simple, this algorithm reveals the core idea of actor-critic methods.

Remarks 的几点解释

Actor ↔ Critic 的分工

Actor：负责学策略 $\pi(a|s,\theta)$ （对应 Policy Gradient 更新）；

Critic：负责学价值函数 $q (s, a, w)$ （对应 SARSA + 函数逼近）。

On-policy 特性

采样数据时，必须按照当前策略 $\pi$ 来生成。

因为 $\nabla_\theta \ln \pi(a|s,\theta)$ 直接依赖于当前策略。

不需要像 Q-learning 一样用 $\varepsilon$ -greedy 来做探索。

为什么叫 QAC？

因为 Critic 用的是 $Q (s, a)$ （动作价值函数），所以叫 Q Actor-Critic。

意义

REINFORCE 用 MC → 高方差；

Actor-Critic 用 TD → 降低方差，更稳定。

QAC 是最简单的 Actor-Critic，但它揭示了 “Actor 调整策略，Critic 提供信号” 的核心思想。

Advantage actor-critic (A2C)

Baseline invariance

Property: the policy gradient is invariant to an additional baseline

$\nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big]$

$\mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) \big(q_\pi(S,A) - b(S)\big) \Big]$

Here, the additional baseline $b (S)$ is a scalar function of $S$ .
Next, we answer two questions:
- Why is it valid?
- Why is it useful?

Baseline Invariance 的核心思想

在 Policy Gradient 中，更新公式是：

$\nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S,\theta_t) q_\pi(S,A) \Big]$

这意味着，策略参数的更新方向由 动作在当前状态下的价值 $q_\pi(S,A)$ 决定。

然而我们可以在公式里引入一个 baseline $b (S)$ ：

$\nabla_\theta J(\theta) = \mathbb{E}{S,A} \Big[ \nabla\theta \ln \pi(A|S,\theta_t) \big(q_\pi(S,A) - b(S)\big) \Big]$

关键结论：无论选什么 $b (S)$ ，这个公式都是 不变的（即 baseline 不会改变期望的梯度方向）。

First, why is it valid?

That is because

$\mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) b(S) \Big] = 0$
The details:

$\begin{aligned} \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) b(S) \Big] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s,\theta_t) \nabla_\theta \ln \pi(a|s,\theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s,\theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s,\theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi(a|s,\theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \nabla_\theta 1 \\ &= 0 \end{aligned}$

为什么这是有效的（Why valid）

我们证明了：

$\mathbb{E}{S,A} \Big[\nabla\theta \ln \pi(A|S,\theta_t) b(S)\Big] = 0$

直观理解：

baseline 只是一个与动作无关的“参考值”；

在期望下，它会完全抵消掉对梯度的影响；

所以 baseline 不会引入偏差（unbiased）。

Second, why is the baseline useful?

The gradient is

$\nabla_\theta J(\theta) = \mathbb{E}[X]$
- where
  
  $\doteq \nabla_\theta \ln \pi(A|S, \theta_t) \big[q_\pi(S,A) - b(S)\big]$
- We have
  - $\mathbb{E}[X]$ is invariant to $b (S)$ .
  - $\mathrm{var}(X)$ is NOT invariant to $b (S)$ .
Why? Because

$\mathrm{tr}[\mathrm{var}(X)] = \mathbb{E}[X^T X] - \bar{x}^T \bar{x}$
and

$\mathbb{E}[X^T X] = \mathbb{E}\Big[ (\nabla_\theta \ln \pi)^T (\nabla_\theta \ln \pi)(q_\pi(S,A) - b(S))^2 \Big] = \mathbb{E}\Big[ \|\nabla_\theta \ln \pi\|^2 (q_\pi(S,A) - b(S))^2 \Big]$

为什么这是有用的（Why useful）

虽然 baseline 不改变期望梯度，但它会影响 梯度估计的方差：

真实更新用的是采样近似：

$\nabla_\theta J \approx \nabla_\theta \ln \pi(a|s,\theta) (q_\pi(s,a) - b(s))$

如果没有 baseline，方差会很大（因为 $q_\pi(s,a)$ 波动很大）；

引入合适的 baseline，可以显著降低方差，提高稳定性。

这就是 Variance Reduction 的思想。

Our goal

Select an optimal baseline b to minimize $\mathrm{var}(X)$ .
- Benefit: when we use a random sample to approximate $\mathbb{E}[X]$ , the estimation variance would also be small.
In the algorithms of REINFORCE and QAC,
- There is no baseline.
- Or, we can say $b = 0$ , which is not guaranteed to be a good baseline.

The optimal baseline

The optimal baseline that can minimize $\mathrm{var}(X)$ is, for any $\in \mathcal{S}$ ,

$b^*(s) = \frac{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta_t)\|^2 q_\pi(s,A) \big]}{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta_t)\|^2 \big]}$
Although this baseline is optimal, it is complex.
We can remove the weight $\|\nabla_\theta \ln \pi(A|s,\theta_t)\|^2$ and select the suboptimal baseline:

$\mathbb{E}{A \sim \pi}[q\pi(s,A)] = v_\pi(s)$
- which is the state value of $s$ .

最优 baseline 与次优 baseline

最优 baseline：

$b^*(s) = \frac{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta)\|^2 q_\pi(s,A) \big]}{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta)\|^2 \big]}$

理论上最优，但计算复杂。

次优 baseline：

$\mathbb{E}{A \sim \pi}[q\pi(s,A)] = v_\pi(s)$

也就是 状态价值函数。

这就是 A2C 的关键：用 $q_\pi(s,a) - v_\pi(s)$ 作为 优势函数 (Advantage)。

与 A2C (Advantage Actor-Critic) 的联系

Actor 部分：

使用优势函数 $q_\pi(s,a) - v_\pi(s)$ 更新策略：

$\theta \leftarrow \theta + \alpha \nabla_\theta \ln \pi(a|s,\theta) A(s,a)$

Critic 部分：

学习价值函数 $v_\pi(s)$ 来作为 baseline $b (s)$ 。

直观解释：

Critic 估计 baseline（即状态价值 $v_\pi(s)$ ）；

Actor 使用 $q_\pi - v_\pi$ 来更新，这样高于期望的动作会被增强，低于期望的动作会被削弱；

好处是降低了梯度的方差，更新更稳定。

The algorithm of advantage actor-critic

When $v_\pi(s)$ ,

the gradient-ascent algorithm is

$\theta_{t+1} = \theta_t + \alpha \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S,A) - v_\pi(S)]\Big]$

$\doteq \theta_t + \alpha \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S, \theta_t)\delta_\pi(S,A)\Big]$
- where
  
  $\delta_\pi(S,A) \doteq q_\pi(S,A) - v_\pi(S)$
- is called the advantage function (why called advantage?).
the stochastic version of this algorithm is

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)[q_t(s_t, a_t) - v_t(s_t)]$

$\theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t,a_t)$

Moreover, the algorithm can be reexpressed as

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t,a_t)$

$\theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t,a_t)$

$\theta_t + \alpha \Bigg(\frac{\delta_t(s_t,a_t)}{\pi(a_t|s_t, \theta_t)}\Bigg) \nabla_\theta \pi(a_t|s_t, \theta_t)$

The step size is proportional to the relative value $\delta_t$ rather than the absolute value $q_t$ , which is more reasonable.
It can still well balance exploration and exploitation.

Furthermore, the advantage function is approximated by the TD error:

$\delta_t = q_t(s_t,a_t) - v_t(s_t) ;;\to;; r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t)$

This approximation is reasonable because

$\mathbb{E}[q_\pi(S,A) - v_\pi(S)|S=s_t, A=a_t] = \mathbb{E}[R + \gamma v_\pi(S’) - v_\pi(S)|S=s_t, A=a_t]$
Benefit: only need one network to approximate $v_\pi(s)$ rather than two networks for $q_\pi(s,a)$ and $v_\pi(s)$ .

Advantage actor-critic (A2C) or TD actor-critic

Aim: Search for an optimal policy by maximizing $J(\theta)$ .
At time step $t$ in each episode, do
- Generate $a_t$ following $\pi(a|s_t, \theta_t)$ and then observe $r_{t+1}, s_{t+1}$ .
- TD error (advantage function):
  
  $\delta_t = r_{t+1} + \gamma v(s_{t+1}, w_t) - v(s_t, w_t)$
- Critic (value update):
  
  $w_{t+1} = w_t + \alpha_w \delta_t \nabla_w v(s_t, w_t)$
- Actor (policy update):
  
  $\theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \ln \pi(a_t|s_t, \theta_t)$
It is on-policy. Since the policy $\pi(\theta_t)$ is stochastic, no need to use techniques like $\varepsilon$ -greedy.

Baseline → Advantage Function → A2C 算法实现

从 Baseline 到 Advantage Function

在 Policy Gradient 里，我们有基本的更新公式：

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t)$

但是直接使用 $q_\pi(s,a)$ 容易带来 高方差。于是我们可以引入 baseline $b (s)$ 来减少方差：

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t)\big[q_\pi(s_t,a_t) - b(s_t)\big]$

一个常见选择是 状态价值函数 $v_\pi(s)$ 。这样，更新公式变为：

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t)\delta_\pi(s_t,a_t)$

其中

$\delta_\pi(s,a) = q_\pi(s,a) - v_\pi(s)$

被称为 Advantage Function。

直观解释：

$q_\pi(s,a)$ 表示在状态 $s$ 下执行动作 $a$ 的长期价值。

$v_\pi(s)$ 表示在状态 $s$ 下的平均价值（对所有动作加权）。

因此 $\delta_\pi(s,a)$ 衡量了 这个动作相对平均水平的好坏。

$\delta > 0$ → 动作比平均好，应增加概率。

$\delta < 0$ → 动作比平均差，应降低概率。

更新公式的进一步改写

通过概率比形式，可以把更新写为：

$\theta_{t+1} = \theta_t + \alpha \Bigg(\frac{\delta_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)}\Bigg) \nabla_\theta \pi(a_t|s_t,\theta_t)$

这样可以看出，步长（step size）和 advantage 的相对大小直接挂钩，从而更合理地平衡探索（exploration）与利用（exploitation）。

3. Advantage 的近似：TD Error

直接计算 $q_\pi(s,a)$ 代价太大，所以引入 时间差分误差 (TD error) 来近似：

$\delta_t = q_t(s_t,a_t) - v_t(s_t) ;;\approx;; r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t)$

好处：

只需学习 一个价值函数网络 $v_\pi(s)$ ，不需要同时学习 $q_\pi(s,a)$ 和 $v_\pi(s)$ ，降低计算复杂度。

$\delta_t$ 既是 TD 误差，也是 Advantage Function 的近似。

Advantage Actor-Critic (A2C) 算法流程

A2C 结合了 Actor（策略更新） 和 Critic（价值函数更新）：

Critic（学习 $v (s)$ ，提供学习信号）

$w_{t+1} = w_t + \alpha_w \delta_t \nabla_w v(s_t, w_t)$

这里 critic 用 TD 误差 $\delta_t$ 来更新 $v (s)$ 。

Actor（更新策略）

$\theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \ln \pi(a_t|s_t,\theta_t)$

这里 actor 根据 critic 提供的 $\delta_t$ 信号，调整策略。

直观解释：

Critic 判断“这个动作到底好不好”，并计算 $\delta_t$ 。

Actor 根据 Critic 的反馈，增加好动作的概率，减少坏动作的概率。

为什么 A2C 有优势？

降低方差：baseline（ $v (s)$ ）有效减少了更新的随机性。

信号更直观：Advantage 告诉我们动作相对平均水平的好坏，而不是绝对值。

更高效：用 TD 误差近似 $q (s, a) - v (s)$ ，只需一个 Critic 网络。

仍然 On-Policy：采样和更新在当前策略下完成，不需要额外探索机制（例如 $\epsilon$ -greedy）。

总结下来，A2C 的关键逻辑是：

Policy Gradient + Baseline → Advantage Function → 用 TD 误差近似 Advantage → Actor & Critic 协同更新。

总结

Actor-Critic 方法通过 Critic 提供的价值估计来指导 Actor 的策略更新，而在确定性策略梯度（DPG/DDPG）中，Actor 直接输出动作并利用 $\nabla_\theta \mu(s)\nabla_a q(s,a)$ 更新，避免了概率采样，高效适用于连续动作空间，但探索需依赖额外噪声。

RL【10-1】：Actor - Critic

系列文章目录

Fundamental Tools

Algorithm

Method

文章目录

前言

Introduction

The simplest actor-critic (QAC)

Advantage actor-critic (A2C)

Baseline invariance

The algorithm of advantage actor-critic

总结

网站公告

今日签到

热门文章

最新发布