系列文章目录
Fundamental Tools
RL【1】:Basic Concepts
RL【2】:Bellman Equation
RL【3】:Bellman Optimality Equation
Algorithm
RL【4】:Value Iteration and Policy Iteration
RL【5】:Monte Carlo Learning
RL【6】:Stochastic Approximation and Stochastic Gradient Descent
Method
RL【7-1】:Temporal-difference Learning
RL【7-2】:Temporal-difference Learning
RL【8】:Value Function Approximation
RL【9】:Policy Gradient
RL【10-1】:Actor - Critic
RL【10-2】:Actor - Critic
文章目录
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Introduction
Actor-critic methods are still policy gradient methods.
- They emphasize the structure that incorporates the policy gradient and
value-based methods.
What are “actor” and “critic”?
- Here, “actor” refers to policy update. It is called actor is because the policies will be applied to take actions.
- Here, “critic” refers to policy evaluation or value estimation. It is called critic because it criticizes the policy by evaluating it.
Actor 和 Critic 的角色和功能可以这么理解
Actor(行动者)
功能:
- Actor 负责 决策,即根据当前状态 s s s 输出动作 a a a 的概率分布(策略)。
数学形式:
通常表示为一个参数化策略 π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s),参数 θ \theta θ 来自神经网络。
直观类比:
相当于一个“演员”,在舞台上看到当前环境状态后,决定下一步要表演的动作。
输出:
- 离散动作环境 → 各动作的概率分布。
- 连续动作环境 → 动作的均值与方差。
Critic(评论家)
功能:
Critic 负责 评价 Actor 的动作选择得好不好。它通过估计 价值函数(Value Function)来衡量某状态或状态–动作对的长期回报。
数学形式:
状态值函数 V π ( s ) V^\pi(s) Vπ(s)
或动作值函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)
Critic 通过比较 实际回报 和 预测值 来给 Actor 提供梯度信号。
直观类比:
相当于一个“评论家”,不表演,但会点评“刚刚那个动作好/不好”,并给出改进方向。
Actor–Critic 的交互流程
- Actor 决策:根据状态 s t s_t st 选取动作 a t a_t at。
- 环境反馈:环境返回奖励 r t r_t rt 和下一个状态 s t + 1 s_{t+1} st+1。
- Critic 评价:通过 TD(Temporal Difference) 误差 δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)−V(st) 来衡量好坏。
- Actor 更新:利用 Critic 给的信号( δ t \delta_t δt)来更新策略参数 θ \theta θ。
The simplest actor-critic (QAC)
Revisit the idea of policy gradient
Revisit the idea of policy gradient
A scalar metric J ( θ ) J(\theta) J(θ), which can be v ˉ π \bar v_\pi vˉπ or r ˉ π \bar r_\pi rˉπ.
The gradient-ascent algorithm maximizing J ( θ ) J(\theta) J(θ) is
θ t + 1 = θ t + α ∇ θ J ( θ t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) θt+1=θt+α∇θJ(θt)
= θ t + α E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] = \theta_t + \alpha \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big] =θt+αES∼η,A∼π[∇θlnπ(A∣S,θt)qπ(S,A)]
The stochastic gradient-ascent algorithm is
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) q_t(s_t,a_t) θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at)
We can see “actor” and “critic” from this algorithm:
- This algorithm corresponds to actor!
- The algorithm estimating q t ( s , a ) q_t(s,a) qt(s,a) corresponds to critic!
从 Policy Gradient 到 Actor-Critic
在 Policy Gradient (PG) 方法中,我们的目标是最大化某个指标(如 v ˉ π \bar v_\pi vˉπ 或 r ˉ π \bar r_\pi rˉπ):
θ t + 1 = θ t + α ∇ θ J ( θ t ) = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) = \theta_t + \alpha \mathbb{E}{S \sim \eta, A \sim \pi}\big[\nabla\theta \ln \pi(A|S,\theta_t) q_\pi(S,A)\big] θt+1=θt+α∇θJ(θt)=θt+αES∼η,A∼π[∇θlnπ(A∣S,θt)qπ(S,A)]
- Actor 部分: ∇ θ ln π ( A ∣ S , θ ) \nabla_\theta \ln \pi(A|S,\theta) ∇θlnπ(A∣S,θ),决定如何更新策略参数 θ \theta θ;
- Critic 部分: q π ( S , A ) q_\pi(S,A) qπ(S,A),决定给 Actor 提供的学习信号(Critic 来评估 当前动作的价值,并把这个评估结果作为学习信号反馈给 Actor)。
但是:
- q π ( s , a ) q_\pi(s,a) qπ(s,a) 在真实环境中是 未知 的 → 需要估计。
- 如果用 Monte Carlo 方法估计,就得到 REINFORCE;
- 如果用函数逼近 + 时序差分 (TD) 来估计,就得到 Actor-Critic。
The simplest actor-critic algorithm (QAC)
- Aim: Search for an optimal policy by maximizing J ( θ ) J(\theta) J(θ).
- At time step t t t in each episode, do:
Generate a t a_t at following π ( a ∣ s t , θ t ) \pi(a|s_t,\theta_t) π(a∣st,θt), observe r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1, and then generate a t + 1 a_{t+1} at+1 following π ( a ∣ s t + 1 , θ t ) \pi(a|s_{t+1}, \theta_t) π(a∣st+1,θt).
Critic (value update):
w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) w_{t+1} = w_t + \alpha_w \big[ r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_t) - q(s_t,a_t,w_t) \big] \nabla_w q(s_t,a_t,w_t) wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)−q(st,at,wt)]∇wq(st,at,wt)
Actor (policy update):
θ t + 1 = θ t + α θ ∇ θ ln π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q(s_t,a_t,w_{t+1}) θt+1=θt+αθ∇θlnπ(at∣st,θt)q(st,at,wt+1)
Actor-Critic 框架 (QAC)
在 Q Actor-Critic (QAC) 中:
Actor(策略更新器):
根据 Critic 提供的 q ( s , a ) q(s,a) q(s,a) 更新策略参数 θ \theta θ:
θ t + 1 = θ t + α θ ∇ θ ln π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q(s_t,a_t,w_{t+1}) θt+1=θt+αθ∇θlnπ(at∣st,θt)q(st,at,wt+1)
→ 这一步是 策略梯度更新,提高高价值动作的概率。
Critic(价值评估器):
使用 TD 方法来更新 q ( s , a ) q(s,a) q(s,a) 的参数 w w w:
w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) w_{t+1} = w_t + \alpha_w \big[ r_{t+1} + \gamma q(s_{t+1},a_{t+1},w_t) - q(s_t,a_t,w_t) \big] \nabla_w q(s_t,a_t,w_t) wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)−q(st,at,wt)]∇wq(st,at,wt)
→ 这一步是 价值函数近似,修正对 q ( s , a ) q(s,a) q(s,a) 的估计。
Remarks:
- The critic corresponds to “SARSA + value function approximation”.
- The actor corresponds to the policy update algorithm.
- The algorithm is on-policy (why is PG on-policy?).
- Since the policy is stochastic, no need to use techniques like ε \varepsilon ε-greedy.
- This particular actor-critic algorithm is sometimes referred to as Q Actor-Critic (QAC).
- Though simple, this algorithm reveals the core idea of actor-critic methods.
Remarks 的几点解释
- Actor ↔ Critic 的分工
- Actor:负责学策略 π ( a ∣ s , θ ) \pi(a|s,\theta) π(a∣s,θ)(对应 Policy Gradient 更新);
- Critic:负责学价值函数 q ( s , a , w ) q(s,a,w) q(s,a,w)(对应 SARSA + 函数逼近)。
- On-policy 特性
- 采样数据时,必须按照当前策略 π \pi π 来生成。
- 因为 ∇ θ ln π ( a ∣ s , θ ) \nabla_\theta \ln \pi(a|s,\theta) ∇θlnπ(a∣s,θ) 直接依赖于当前策略。
- 不需要像 Q-learning 一样用 ε \varepsilon ε-greedy 来做探索。
- 为什么叫 QAC?
- 因为 Critic 用的是 Q ( s , a ) Q(s,a) Q(s,a) (动作价值函数),所以叫 Q Actor-Critic。
- 意义
- REINFORCE 用 MC → 高方差;
- Actor-Critic 用 TD → 降低方差,更稳定。
- QAC 是最简单的 Actor-Critic,但它揭示了 “Actor 调整策略,Critic 提供信号” 的核心思想。
Advantage actor-critic (A2C)
Baseline invariance
Property: the policy gradient is invariant to an additional baseline
∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] \nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) q_\pi(S,A) \Big] ∇θJ(θ)=ES∼η,A∼π[∇θlnπ(A∣S,θt)qπ(S,A)]
= E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S, \theta_t) \big(q_\pi(S,A) - b(S)\big) \Big] =ES∼η,A∼π[∇θlnπ(A∣S,θt)(qπ(S,A)−b(S))]
- Here, the additional baseline b ( S ) b(S) b(S) is a scalar function of S S S.
- Next, we answer two questions:
- Why is it valid?
- Why is it useful?
Baseline Invariance 的核心思想
在 Policy Gradient 中,更新公式是:
∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] \nabla_\theta J(\theta) = \mathbb{E}{S \sim \eta, A \sim \pi} \Big[ \nabla\theta \ln \pi(A|S,\theta_t) q_\pi(S,A) \Big] ∇θJ(θ)=ES∼η,A∼π[∇θlnπ(A∣S,θt)qπ(S,A)]
这意味着,策略参数的更新方向由 动作在当前状态下的价值 q π ( S , A ) q_\pi(S,A) qπ(S,A) 决定。
然而我们可以在公式里引入一个 baseline b ( S ) b(S) b(S):
∇ θ J ( θ ) = E S , A [ ∇ θ ln π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] \nabla_\theta J(\theta) = \mathbb{E}{S,A} \Big[ \nabla\theta \ln \pi(A|S,\theta_t) \big(q_\pi(S,A) - b(S)\big) \Big] ∇θJ(θ)=ES,A[∇θlnπ(A∣S,θt)(qπ(S,A)−b(S))]
关键结论:无论选什么 b ( S ) b(S) b(S),这个公式都是 不变的(即 baseline 不会改变期望的梯度方向)。
First, why is it valid?
That is because
E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) b(S) \Big] = 0 ES∼η,A∼π[∇θlnπ(A∣S,θt)b(S)]=0
The details:
E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s , θ t ) ∇ θ ln π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ ∑ a ∈ A π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}{S \sim \eta, A \sim \pi}\Big[ \nabla\theta \ln \pi(A|S, \theta_t) b(S) \Big] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s,\theta_t) \nabla_\theta \ln \pi(a|s,\theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s,\theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s,\theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi(a|s,\theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s)b(s) \nabla_\theta 1 \\ &= 0 \end{aligned} ES∼η,A∼π[∇θlnπ(A∣S,θt)b(S)]=s∈S∑η(s)a∈A∑π(a∣s,θt)∇θlnπ(a∣s,θt)b(s)=s∈S∑η(s)a∈A∑∇θπ(a∣s,θt)b(s)=s∈S∑η(s)b(s)a∈A∑∇θπ(a∣s,θt)=s∈S∑η(s)b(s)∇θa∈A∑π(a∣s,θt)=s∈S∑η(s)b(s)∇θ1=0
为什么这是有效的(Why valid)
我们证明了:
E S , A [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}{S,A} \Big[\nabla\theta \ln \pi(A|S,\theta_t) b(S)\Big] = 0 ES,A[∇θlnπ(A∣S,θt)b(S)]=0
直观理解:
- baseline 只是一个与动作无关的“参考值”;
- 在期望下,它会完全抵消掉对梯度的影响;
- 所以 baseline 不会引入偏差(unbiased)。
Second, why is the baseline useful?
The gradient is
∇ θ J ( θ ) = E [ X ] \nabla_\theta J(\theta) = \mathbb{E}[X] ∇θJ(θ)=E[X]
where
X ( S , A ) ≐ ∇ θ ln π ( A ∣ S , θ t ) [ q π ( S , A ) − b ( S ) ] X(S,A) \doteq \nabla_\theta \ln \pi(A|S, \theta_t) \big[q_\pi(S,A) - b(S)\big] X(S,A)≐∇θlnπ(A∣S,θt)[qπ(S,A)−b(S)]
We have
- E [ X ] \mathbb{E}[X] E[X] is invariant to b ( S ) b(S) b(S).
- v a r ( X ) \mathrm{var}(X) var(X) is NOT invariant to b ( S ) b(S) b(S).
Why? Because
t r [ v a r ( X ) ] = E [ X T X ] − x ˉ T x ˉ \mathrm{tr}[\mathrm{var}(X)] = \mathbb{E}[X^T X] - \bar{x}^T \bar{x} tr[var(X)]=E[XTX]−xˉTxˉ
and
E [ X T X ] = E [ ( ∇ θ ln π ) T ( ∇ θ ln π ) ( q π ( S , A ) − b ( S ) ) 2 ] = E [ ∥ ∇ θ ln π ∥ 2 ( q π ( S , A ) − b ( S ) ) 2 ] \mathbb{E}[X^T X] = \mathbb{E}\Big[ (\nabla_\theta \ln \pi)^T (\nabla_\theta \ln \pi)(q_\pi(S,A) - b(S))^2 \Big] = \mathbb{E}\Big[ \|\nabla_\theta \ln \pi\|^2 (q_\pi(S,A) - b(S))^2 \Big] E[XTX]=E[(∇θlnπ)T(∇θlnπ)(qπ(S,A)−b(S))2]=E[∥∇θlnπ∥2(qπ(S,A)−b(S))2]
为什么这是有用的(Why useful)
虽然 baseline 不改变期望梯度,但它会影响 梯度估计的方差:
真实更新用的是采样近似:
∇ θ J ≈ ∇ θ ln π ( a ∣ s , θ ) ( q π ( s , a ) − b ( s ) ) \nabla_\theta J \approx \nabla_\theta \ln \pi(a|s,\theta) (q_\pi(s,a) - b(s)) ∇θJ≈∇θlnπ(a∣s,θ)(qπ(s,a)−b(s))
如果没有 baseline,方差会很大(因为 q π ( s , a ) q_\pi(s,a) qπ(s,a) 波动很大);
引入合适的 baseline,可以显著降低方差,提高稳定性。
这就是 Variance Reduction 的思想。
Our goal
- Select an optimal baseline b to minimize v a r ( X ) \mathrm{var}(X) var(X).
- Benefit: when we use a random sample to approximate E [ X ] \mathbb{E}[X] E[X], the estimation variance would also be small.
- In the algorithms of REINFORCE and QAC,
- There is no baseline.
- Or, we can say b = 0 b=0 b=0, which is not guaranteed to be a good baseline.
The optimal baseline
The optimal baseline that can minimize v a r ( X ) \mathrm{var}(X) var(X) is, for any s ∈ S s \in \mathcal{S} s∈S,
b ∗ ( s ) = E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 q π ( s , A ) ] E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 ] b^*(s) = \frac{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta_t)\|^2 q_\pi(s,A) \big]}{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta_t)\|^2 \big]} b∗(s)=EA∼π[∥∇θlnπ(A∣s,θt)∥2]EA∼π[∥∇θlnπ(A∣s,θt)∥2qπ(s,A)]
Although this baseline is optimal, it is complex.
We can remove the weight ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 \|\nabla_\theta \ln \pi(A|s,\theta_t)\|^2 ∥∇θlnπ(A∣s,θt)∥2 and select the suboptimal baseline:
b ( s ) = E A ∼ π [ q π ( s , A ) ] = v π ( s ) b(s) = \mathbb{E}{A \sim \pi}[q\pi(s,A)] = v_\pi(s) b(s)=EA∼π[qπ(s,A)]=vπ(s)
- which is the state value of s s s.
最优 baseline 与次优 baseline
最优 baseline:
b ∗ ( s ) = E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ ) ∥ 2 q π ( s , A ) ] E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ ) ∥ 2 ] b^*(s) = \frac{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta)\|^2 q_\pi(s,A) \big]}{\mathbb{E}{A \sim \pi} \big[ \|\nabla\theta \ln \pi(A|s,\theta)\|^2 \big]} b∗(s)=EA∼π[∥∇θlnπ(A∣s,θ)∥2]EA∼π[∥∇θlnπ(A∣s,θ)∥2qπ(s,A)]
理论上最优,但计算复杂。
次优 baseline:
b ( s ) = E A ∼ π [ q π ( s , A ) ] = v π ( s ) b(s) = \mathbb{E}{A \sim \pi}[q\pi(s,A)] = v_\pi(s) b(s)=EA∼π[qπ(s,A)]=vπ(s)
也就是 状态价值函数。
这就是 A2C 的关键:用 A ( s , a ) = q π ( s , a ) − v π ( s ) A(s,a) = q_\pi(s,a) - v_\pi(s) A(s,a)=qπ(s,a)−vπ(s) 作为 优势函数 (Advantage)。
与 A2C (Advantage Actor-Critic) 的联系
Actor 部分:
使用优势函数 A ( s , a ) = q π ( s , a ) − v π ( s ) A(s,a) = q_\pi(s,a) - v_\pi(s) A(s,a)=qπ(s,a)−vπ(s) 更新策略:
θ ← θ + α ∇ θ ln π ( a ∣ s , θ ) A ( s , a ) \theta \leftarrow \theta + \alpha \nabla_\theta \ln \pi(a|s,\theta) A(s,a) θ←θ+α∇θlnπ(a∣s,θ)A(s,a)
Critic 部分:
学习价值函数 v π ( s ) v_\pi(s) vπ(s) 来作为 baseline b ( s ) b(s) b(s)。
直观解释:
- Critic 估计 baseline(即状态价值 v π ( s ) v_\pi(s) vπ(s));
- Actor 使用 q π − v π q_\pi - v_\pi qπ−vπ 来更新,这样高于期望的动作会被增强,低于期望的动作会被削弱;
- 好处是降低了梯度的方差,更新更稳定。
The algorithm of advantage actor-critic
When b ( s ) = v π ( s ) b(s) = v_\pi(s) b(s)=vπ(s),
the gradient-ascent algorithm is
θ t + 1 = θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) [ q π ( S , A ) − v π ( S ) ] ] \theta_{t+1} = \theta_t + \alpha \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S,A) - v_\pi(S)]\Big] θt+1=θt+αE[∇θlnπ(A∣S,θt)[qπ(S,A)−vπ(S)]]
≐ θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) δ π ( S , A ) ] \doteq \theta_t + \alpha \mathbb{E}\Big[\nabla_\theta \ln \pi(A|S, \theta_t)\delta_\pi(S,A)\Big] ≐θt+αE[∇θlnπ(A∣S,θt)δπ(S,A)]
where
δ π ( S , A ) ≐ q π ( S , A ) − v π ( S ) \delta_\pi(S,A) \doteq q_\pi(S,A) - v_\pi(S) δπ(S,A)≐qπ(S,A)−vπ(S)
is called the advantage function (why called advantage?).
the stochastic version of this algorithm is
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) [ q t ( s t , a t ) − v t ( s t ) ] \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)[q_t(s_t, a_t) - v_t(s_t)] θt+1=θt+α∇θlnπ(at∣st,θt)[qt(st,at)−vt(st)]
= θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t,a_t) =θt+α∇θlnπ(at∣st,θt)δt(st,at)
Moreover, the algorithm can be reexpressed as
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t,a_t) θt+1=θt+α∇θlnπ(at∣st,θt)δt(st,at)
= θ t + α ∇ θ π ( a t ∣ s t , θ t ) π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t,a_t) =θt+απ(at∣st,θt)∇θπ(at∣st,θt)δt(st,at)
= θ t + α ( δ t ( s t , a t ) π ( a t ∣ s t , θ t ) ) ∇ θ π ( a t ∣ s t , θ t ) = \theta_t + \alpha \Bigg(\frac{\delta_t(s_t,a_t)}{\pi(a_t|s_t, \theta_t)}\Bigg) \nabla_\theta \pi(a_t|s_t, \theta_t) =θt+α(π(at∣st,θt)δt(st,at))∇θπ(at∣st,θt)
- The step size is proportional to the relative value δ t \delta_t δt rather than the absolute value q t q_t qt, which is more reasonable.
- It can still well balance exploration and exploitation.
Furthermore, the advantage function is approximated by the TD error:
δ t = q t ( s t , a t ) − v t ( s t ) ; ; → ; ; r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \delta_t = q_t(s_t,a_t) - v_t(s_t) ;;\to;; r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) δt=qt(st,at)−vt(st);;→;;rt+1+γvt(st+1)−vt(st)
This approximation is reasonable because
E [ q π ( S , A ) − v π ( S ) ∣ S = s t , A = a t ] = E [ R + γ v π ( S ’ ) − v π ( S ) ∣ S = s t , A = a t ] \mathbb{E}[q_\pi(S,A) - v_\pi(S)|S=s_t, A=a_t] = \mathbb{E}[R + \gamma v_\pi(S’) - v_\pi(S)|S=s_t, A=a_t] E[qπ(S,A)−vπ(S)∣S=st,A=at]=E[R+γvπ(S’)−vπ(S)∣S=st,A=at]
Benefit: only need one network to approximate v π ( s ) v_\pi(s) vπ(s) rather than two networks for q π ( s , a ) q_\pi(s,a) qπ(s,a) and v π ( s ) v_\pi(s) vπ(s).
Advantage actor-critic (A2C) or TD actor-critic
Aim: Search for an optimal policy by maximizing J ( θ ) J(\theta) J(θ).
At time step t t t in each episode, do
Generate a t a_t at following π ( a ∣ s t , θ t ) \pi(a|s_t, \theta_t) π(a∣st,θt) and then observe r t + 1 , s t + 1 r_{t+1}, s_{t+1} rt+1,st+1.
TD error (advantage function):
δ t = r t + 1 + γ v ( s t + 1 , w t ) − v ( s t , w t ) \delta_t = r_{t+1} + \gamma v(s_{t+1}, w_t) - v(s_t, w_t) δt=rt+1+γv(st+1,wt)−v(st,wt)
Critic (value update):
w t + 1 = w t + α w δ t ∇ w v ( s t , w t ) w_{t+1} = w_t + \alpha_w \delta_t \nabla_w v(s_t, w_t) wt+1=wt+αwδt∇wv(st,wt)
Actor (policy update):
θ t + 1 = θ t + α θ δ t ∇ θ ln π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \ln \pi(a_t|s_t, \theta_t) θt+1=θt+αθδt∇θlnπ(at∣st,θt)
It is on-policy. Since the policy π ( θ t ) \pi(\theta_t) π(θt) is stochastic, no need to use techniques like ε \varepsilon ε-greedy.
Baseline → Advantage Function → A2C 算法实现
- 从 Baseline 到 Advantage Function
在 Policy Gradient 里,我们有基本的更新公式:
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) q π ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t) θt+1=θt+α∇θlnπ(at∣st,θt)qπ(st,at)
但是直接使用 q π ( s , a ) q_\pi(s,a) qπ(s,a) 容易带来 高方差。于是我们可以引入 baseline b ( s ) b(s) b(s) 来减少方差:
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) [ q π ( s t , a t ) − b ( s t ) ] \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t)\big[q_\pi(s_t,a_t) - b(s_t)\big] θt+1=θt+α∇θlnπ(at∣st,θt)[qπ(st,at)−b(st)]
一个常见选择是 状态价值函数 b ( s ) = v π ( s ) b(s) = v_\pi(s) b(s)=vπ(s)。这样,更新公式变为:
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ π ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t,\theta_t)\delta_\pi(s_t,a_t) θt+1=θt+α∇θlnπ(at∣st,θt)δπ(st,at)
其中
δ π ( s , a ) = q π ( s , a ) − v π ( s ) \delta_\pi(s,a) = q_\pi(s,a) - v_\pi(s) δπ(s,a)=qπ(s,a)−vπ(s)
被称为 Advantage Function。
直观解释:
- q π ( s , a ) q_\pi(s,a) qπ(s,a) 表示在状态 s s s 下执行动作 a a a 的长期价值。
- v π ( s ) v_\pi(s) vπ(s) 表示在状态 s s s 下的平均价值(对所有动作加权)。
- 因此 δ π ( s , a ) \delta_\pi(s,a) δπ(s,a) 衡量了 这个动作相对平均水平的好坏。
- δ > 0 \delta > 0 δ>0 → 动作比平均好,应增加概率。
- δ < 0 \delta < 0 δ<0 → 动作比平均差,应降低概率。
- 更新公式的进一步改写
通过概率比形式,可以把更新写为:
θ t + 1 = θ t + α ( δ t ( s t , a t ) π ( a t ∣ s t , θ t ) ) ∇ θ π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha \Bigg(\frac{\delta_t(s_t,a_t)}{\pi(a_t|s_t,\theta_t)}\Bigg) \nabla_\theta \pi(a_t|s_t,\theta_t) θt+1=θt+α(π(at∣st,θt)δt(st,at))∇θπ(at∣st,θt)
这样可以看出,步长(step size)和 advantage 的相对大小直接挂钩,从而更合理地平衡探索(exploration)与利用(exploitation)。
3. Advantage 的近似:TD Error
直接计算 q π ( s , a ) q_\pi(s,a) qπ(s,a) 代价太大,所以引入 时间差分误差 (TD error) 来近似:
δ t = q t ( s t , a t ) − v t ( s t ) ; ; ≈ ; ; r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \delta_t = q_t(s_t,a_t) - v_t(s_t) ;;\approx;; r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) δt=qt(st,at)−vt(st);;≈;;rt+1+γvt(st+1)−vt(st)
- 好处:
- 只需学习 一个价值函数网络 v π ( s ) v_\pi(s) vπ(s),不需要同时学习 q π ( s , a ) q_\pi(s,a) qπ(s,a) 和 v π ( s ) v_\pi(s) vπ(s),降低计算复杂度。
- δ t \delta_t δt 既是 TD 误差,也是 Advantage Function 的近似。
Advantage Actor-Critic (A2C) 算法流程
A2C 结合了 Actor(策略更新) 和 Critic(价值函数更新):
Critic(学习 v ( s ) v(s) v(s),提供学习信号)
w t + 1 = w t + α w δ t ∇ w v ( s t , w t ) w_{t+1} = w_t + \alpha_w \delta_t \nabla_w v(s_t, w_t) wt+1=wt+αwδt∇wv(st,wt)
- 这里 critic 用 TD 误差 δ t \delta_t δt 来更新 v ( s ) v(s) v(s)。
Actor(更新策略)
θ t + 1 = θ t + α θ δ t ∇ θ ln π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha_\theta \delta_t \nabla_\theta \ln \pi(a_t|s_t,\theta_t) θt+1=θt+αθδt∇θlnπ(at∣st,θt)
- 这里 actor 根据 critic 提供的 δ t \delta_t δt 信号,调整策略。
直观解释:
- Critic 判断“这个动作到底好不好”,并计算 δ t \delta_t δt。
- Actor 根据 Critic 的反馈,增加好动作的概率,减少坏动作的概率。
为什么 A2C 有优势?
- 降低方差:baseline( v ( s ) v(s) v(s))有效减少了更新的随机性。
- 信号更直观:Advantage 告诉我们动作相对平均水平的好坏,而不是绝对值。
- 更高效:用 TD 误差近似 q ( s , a ) − v ( s ) q(s,a) - v(s) q(s,a)−v(s),只需一个 Critic 网络。
- 仍然 On-Policy:采样和更新在当前策略下完成,不需要额外探索机制(例如 ϵ \epsilon ϵ-greedy)。
总结下来,A2C 的关键逻辑是:
Policy Gradient + Baseline → Advantage Function → 用 TD 误差近似 Advantage → Actor & Critic 协同更新。
总结
Actor-Critic 方法通过 Critic 提供的价值估计来指导 Actor 的策略更新,而在确定性策略梯度(DPG/DDPG)中,Actor 直接输出动作并利用 ∇ θ μ ( s ) ∇ a q ( s , a ) \nabla_\theta \mu(s)\nabla_a q(s,a) ∇θμ(s)∇aq(s,a) 更新,避免了概率采样,高效适用于连续动作空间,但探索需依赖额外噪声。