2

The problem I am facing right now is tying the theory from Sutton & Barto about advantage actor critic to the implementation of A3C I read here. From what I understand:

  1. The critic network (value function) loss function is given by: $L_V = \frac{1}{2} \left(\left[ r + \gamma V(s') \right] - V(s)\right)^2 $ (1)

  2. Actor network maximizes advantage function to approximate the performance of policy: $J(\theta) = A(s, a)$ (2) $\nabla_{\theta} J(\theta) = \mathbb{E} \left[ \nabla_{\theta} \log \pi_{\theta}(a | s) A(s, a) \right]$ (3)

My questions:

  1. For actor, we maximize $J(\theta)$ but I have seen people use $\mathcal{L}_{\text{actor}} = - \mathbb{E}_{t} \left[ \log \pi(a_t | s_t; \theta) \cdot A(s_t, a_t) \right]$. I assume that we are taking $\nabla_{\theta}$ out of the term we derived for $\nabla_{\theta} J(\theta)$ (from (3)) and instead of maximizing the obtained term, we minimize its negative. Am I on the right track?

  2. Because actor and critic use two different loss functions, I thought we will have to setup different optimizers for both of them. But what I have seen, people club the losses into a single loss function. Why is that so?

  3. For CarRacing-v3, the action space size is (1x3) and each element is continuous action space. Should my actor output 6 values (3 mean, 3 variance for each of the action)? Are these values not correlated? If so do I not need a covariance matrix and sample from a multivariate Gaussian?

  4. Is the critic trained similar to Atari DQN by having a target and main critic where target critic is not updated while main critic is trained and both are later synced?

DeadAsDuck
  • 103
  • 6

1 Answers1

2

You are on the right track overall. The actor loss $\mathcal{L}_{\text{actor}}$ is just the negative of the policy gradient objective as hinted by the policy gradient theorem so that you can use minimization instead of maximization of return. Your policy gradient objective equation (2) is missing expectation wrt the random vector state-action pair $(s,a)$, and note that in your referenced Medium article $\mathcal{L}_{\text{actor}}$ is an expectation wrt the random vector state-action pair $(s,a)$, not wrt timestep $t$ as you expressed as the article also mentions that "$G_t=G(s_t,a_t)$ is the expected return for a given state, action pair at timestep $t$".

Even though the actor and critic have different objectives, they are often combined into a single loss function with the same learning rate because in many implementations, such as A3C, parts of the network are shared between the actor and critic for computational efficiency and consistency during joint optimization.

For an environment like CarRacing-v3 with a 3D continuous action space for steering/acceleration/brake, a diagonal multivariate Gaussian policy approach is commonly employed for simplicity. Under such a policy indeed the actor only outputs $6$ values and might need a non-differentiable clipping phase after actor network and action sampling to ensure actions' respective valid ranges such as $[-1,1]$ for steering and $[0,1]$ for accleration/brake. No covariance matrix is needed, though acceleration/brake might be anti-correlated and steering/brake might be correlated, but these are rare.

Finally unlike DQN which typically uses a target network to stabilize learning due to the deadly triad, A3C being an on-policy method doesn’t usually rely on a separate target network for the critic. Instead the critic’s value target is computed using bias-variance balanced n-step or batch/online GAE returns and the critic is updated directly on this target. The asynchronous worker agents and frequent parameter updates in A3C help decorrelate gradients and thus further stabilize learning without a separate target network.

cinch
  • 11,000
  • 3
  • 8
  • 17