1

As discussed in this question, the policy gradient algorithms given in Reinforcement Learning: An Introduction use the gradient \begin{align*} \gamma^t \hat A_t \nabla_{\theta} \log \pi(a_t \, | \, s_t, \theta) \end{align*} where $\hat A_t$ is the advantage estimate for step $t$. For example, $\hat A_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ in the one-step actor-critic algorithm given in section 13.5.

In the answers to the linked question, it is claimed that the extra discounting is "correct", which implies that it should be included.

If I look in the literature to a seminal paper such as Proximal Policy Optimization Algorithms by OpenAI, they do not include the extra discounting factor, i.e. they use a gradient defined as \begin{align*} \hat A_t \dfrac{\nabla_{\theta}\pi(a_t \, | \, s_t, \theta)}{\pi(a_t \, | \,s_t, \theta_{\rm old})} \end{align*} which does not include the discounting factor (of course, it's dealing with the off-policy case, but I don't see how that would make a difference in terms of the discounting). OpenAI's implementation of PPO also does not include the extra discounting factor.

So, how am I supposed to interpret this discrepancy? I agree that the extra discounting factor should be present, from a theoretical standpoint. Then, why is it not in the OpenAI code or paper?

nbro
  • 42,615
  • 12
  • 119
  • 217
Taw
  • 1,321
  • 5
  • 12

2 Answers2

0

If you want to maximize the expected reward \begin{align*} \mathbb{E}\bigg[\sum_{t=1}^nr_t \bigg] \end{align*} and are using a score function based gradient estimator (as opposed to a SAC/DDPG style update), you have the unbiased gradient estimator \begin{align*} \sum_{i=1}^n \sum_{k=i}^n r_k\nabla_{\theta}\log\pi(a_t) \tag{1} \end{align*} Then, you can add discounting as a variance reduction technique; the gradient estimator \begin{align*} \sum_{i=1}^n \sum_{k=i}^n \gamma^{k-i}r_k\nabla_{\theta}\log\pi(a_t) \tag{2} \end{align*} will have a lower variance than Eq. (1) (see this answer).

If you want to maximize the discounted expected reward \begin{align*} \mathbb{E}\bigg[\sum_{t=1}^n \gamma^{t-1}r_t \bigg] \end{align*} you get the unbiased gradient estimator \begin{align*} \sum_{i=1}^n \sum_{k=i}^n \gamma^{k-1}r_k\nabla_{\theta}\log\pi(a_t) \tag{3} \end{align*} So in Sutton and Barto they are essentially presenting the formulation Eq. (3). The difference between (2) and (3) is the factor of $\gamma^{i-1}$ which is what I was confused about in the question.

Thus in summary, the formulation (2) is an biased estimator of the expected reward; whereas (3) is an unbiased estimator of the expected discounted reward.

Taw
  • 1,321
  • 5
  • 12
-1

I believe you will find the answer in the paper High-Dimensional Continuous Control Using Generalized Advantage Estimation, which is the basis for the advantage function used in the PPO paper that you referenced.

From the paper, the estimate of the advantage function is defined as: \begin{align*} \hat{A}_{t}^{GAE(\gamma,\lambda)} = \sum_{l=0}^{\infty}(\gamma\lambda)^{l}\delta_{t+1}^{V} \end{align*} where $\delta_{t}^{V}$, the TD residual of $V$, is defined as: \begin{align*} \delta_{t}^{V} = r_{t}+\gamma V(s_{t+1})-V(s_{t}) \end{align*} where $V$ is an approximate of the value function.

If you look closely at these two equations you will see that the discount $\gamma$ is applied twice.

I never went through the code of the whole OpenAI implementation of PPO, but if I am not mistaken the implementation of the above equations can be found here in ppo2/runner.py.

Lars
  • 189
  • 2
  • 9