0

In the original paper, the objective of PPO is as follows:enter image description here. My question is, how does this objective behave in a sparse reward setting (i.e., reward is only given after a sequence of actions were taken)? In this case we don't have $\hat{A}_{t}$ defined for every $t$.

Sam
  • 205
  • 1
  • 5

0 Answers0