2

GRPO algorithm (simplified by removing clipping) defines the following objective: $$ \dfrac{1}{G} \sum_{i=1}^{G} \dfrac{1}{|o_i|} \sum_{t=1}^{|o_i|} (A_{i,t} - \beta KL) $$ with Advantage $A_{i,t}$ being calculated as: $$A_{i,t} = \dfrac{r_i - mean(r)}{std(r)} $$ In case $\beta$ is very small to the effect that $KL$'s contribution is negligible, summing up the advantages make the total objective close to zero. My questions is in cases this happens (which seems likely for certain choices of hyperparameters), does the model (or policy) not learn much?

Nitin
  • 23
  • 3

1 Answers1

0

Indeed according to OP's objective, if the KL term is negligible then the aggregate learning signal can indeed be very small, meaning that the gradient updates may be minuscule which in turn could lead to slow or stalled learning, though return with baseline subtraction or advantage normalization is standard in policy gradient methods to reduce training variance by focusing on relative learning signals rather than absolute such as REINFORCE with baseline and normalized GAE used in PPO.

However, in reality the objective of GRPO is not simply your above formulation, it's a more subtle clipped objective yet completely rid of PPO's critic via group relative reward measurement within each batch, along with averaging across trajectories indexed with timestep $t$. You can refer its Wikipedia reference or equation (3) of your own reference. So the model is very unlikely to stop learning while still ensuring learning stability, before convergence is reached.

The Group Relative Policy Optimization (GRPO) is a minor variant of PPO that omits the value function estimator ${\displaystyle V}$... it maximizes the PPO objective, averaged over all actions: ${\displaystyle \max _{\theta }{\frac {1}{G}}\sum _{i=1}^{G}\mathbb {E} _{(s,a_{1},\dots ,a_{G})\sim \pi _{\theta _{t}}}\left[{\begin{cases}\min \left({\frac {\pi _{\theta }(a_{i}|s)}{\pi _{\theta _{t}}(a_{i}|s)}},1+\epsilon \right)A^{\pi _{\theta _{t}}}(s,a_{i})&{\text{ if }}A^{\pi _{\theta _{t}}}(s,a_{i})>0\\\max \left({\frac {\pi _{\theta }(a_{i}|s)}{\pi _{\theta _{t}}(a_{i}|s)}},1-\epsilon \right)A^{\pi _{\theta _{t}}}(s,a_{i})&{\text{ if }}A^{\pi _{\theta _{t}}}(s,a_{i})<0\end{cases}}\right]}$ Intuitively, each policy update step in GRPO makes the policy more likely to respond to each state with an action that performed relatively better than other actions tried at that state, and less likely to respond with one that performed relatively worse.

cinch
  • 11,000
  • 3
  • 8
  • 17