The problem I am facing right now is tying the theory from Sutton & Barto about advantage actor critic to the implementation of A3C I read here. From what I understand:
The critic network (value function) loss function is given by: $L_V = \frac{1}{2} \left(\left[ r + \gamma V(s') \right] - V(s)\right)^2 $ (1)
Actor network maximizes advantage function to approximate the performance of policy: $J(\theta) = A(s, a)$ (2) $\nabla_{\theta} J(\theta) = \mathbb{E} \left[ \nabla_{\theta} \log \pi_{\theta}(a | s) A(s, a) \right]$ (3)
My questions:
For actor, we maximize $J(\theta)$ but I have seen people use $\mathcal{L}_{\text{actor}} = - \mathbb{E}_{t} \left[ \log \pi(a_t | s_t; \theta) \cdot A(s_t, a_t) \right]$. I assume that we are taking $\nabla_{\theta}$ out of the term we derived for $\nabla_{\theta} J(\theta)$ (from (3)) and instead of maximizing the obtained term, we minimize its negative. Am I on the right track?
Because actor and critic use two different loss functions, I thought we will have to setup different optimizers for both of them. But what I have seen, people club the losses into a single loss function. Why is that so?
For CarRacing-v3, the action space size is (1x3) and each element is continuous action space. Should my actor output 6 values (3 mean, 3 variance for each of the action)? Are these values not correlated? If so do I not need a covariance matrix and sample from a multivariate Gaussian?
Is the critic trained similar to Atari DQN by having a target and main critic where target critic is not updated while main critic is trained and both are later synced?