Implementation difference in REINFORCE algorithm, where to sum from

Question

I have a question regarding the implementation of the REINFORCE algorithm.

In berkeley course (see slide 9) the gradient is defined as

Note that the return sums from 1. However in Sutton's book the return sums from t+1

In fact in two popular implementations, pytorch's official example (See Line 71) and tianshou's example (See Line 226), if I understand correctly, they are computing

sum_t (∇log π(a_t | s_t) * R_t)

which seems to correspond to Sutton's definition rather than Berkeley's (which will becomes

(sum_t ∇log π(a_t | s_t)) * (sum_t r_t)

).

I wonder which definition is the de facto standard in the community. Thank you.

score 2 · Accepted Answer · answered Jan 27 '24 at 01:09

The policy gradient objective function of your Berkeley reference is undiscounted value function of a start state in finite horizon episodic cases, while the other Sutton's reference at least for the boxed pseudocode is about discounted value function of a start state in finite horizon episodic cases as stated in the book:

From here on in our discussion we will assume no discounting ($\gamma = 1$) for the episodic case, although for completeness we do include the possibility of discounting in the boxed algorithms.

Since the undiscounted total return $G_t$ is defined as $\sum_t^T r_t(s_t,a_t)$, there's no essential notation difference here. Berkeley's equation is an approximation expressed via samples $i$, so it may not be easy to use $G_t$ without further unnecessary confusions such as $G_{t,i}$.

score 1 · Answer 2 · answered Jan 26 '24 at 02:28

1

I think I understand now. The implementation uses a reward to go version, which can be proved to be equivalent to the original loss, see here.

answered Jan 26 '24 at 02:28

Chris XU

33
5

Implementation difference in REINFORCE algorithm, where to sum from

2 Answers2