In the original paper, the objective of PPO is as follows:
. My question is, how does this objective behave in a sparse reward setting (i.e., reward is only given after a sequence of actions were taken)? In this case we don't have $\hat{A}_{t}$ defined for every $t$.
Asked
Active
Viewed 702 times
0
Sam
- 205
- 1
- 5