6

I have been studying policy gradients recently but found different expositions from different sources, which greatly confused me. From the book "Reinforcement Learning: an Introduction (Sutton & Barto Chapter 13)", we get the following policy gradient: $$ \nabla J(\theta) = \mathbb E_\pi\left[G_t\nabla\log\pi(A_t | S_t, \theta)\right]. $$ As we can observe from the equation, it does not relate to trajectory distributions. However, a more intuitive and widely-used introduction to policy gradient starts from defining the distribution of trajectories: $p(\tau)$. For example, in OpenAI Spinning Up, the policy gradient has the form similar to the following equation: $$ \nabla J(\theta) = \mathbb E_{\tau \sim \pi}\left[\sum_{t=0}^{T}G_t\nabla_\theta\log\pi_\theta(a_t | s_t)\right]. $$ The confusion comes from the fact that the first policy gradient does not have a summation over timestamps and is not sampling from trajectories, but the second samples from trajectories and has a summation.

I did find some relevant questions about this confusion, but none of them seemed to have a good answer. Also, I could not identify any source that explained the difference/connection between the two forms.

My question is why are there two different ways to describe the policy gradient and are the two forms mathematically equivalent?

Update

I found a great RL theory book (draft) written by some expert Professors in this field that shows two different formulations: https://rltheorybook.github.io. Also, one of Nan's lecture notes shows the correlation between the two forms: https://nanjiang.cs.illinois.edu/files/cs598/note6.pdf.

1 Answers1

0

After reviewing Nan's lecture notes above, here are my conclusions

let $d_t^\pi$ denote the distribution over $s_t$ and $a_t$ induced by policy $\pi$ from the initial state distribution $d_0$, the first form is formally expressed as \begin{equation} \nabla J(\pi)=\frac{1}{1-\gamma}\mathbb{E}_{(s,a)\thicksim d^\pi}\left[(\nabla\log\pi(a|s))Q^\pi(s,a)\right]. \end{equation} where $d^\pi$ is the normalized state-action distribution shown below \begin{equation} d^\pi = (1 - \gamma) \sum_{t' = 1}^{\infty} \gamma^{t' - 1} d_{\pi}^{t'} \end{equation} We can see that this form sums over timestamps in state-action distribution $d^\pi$, while the second form sums inside the expectation. They capture long-term information in different ways.

Besides, since $\frac{1}{1-\gamma}$ is a constant, we can ignore it and treat it as a part of learning rate.