0

In the paper,

They write:

enter image description here

Now, is $y$ the full response or only the next token repsonse? One the one hand, the reward model expects full response, on the other hand they write 'per-token KL penalty'. So we sample the next token, or we sample the next full response?

Secondly, is the equation correct? After all it looks they only optimize for lower KL divergence, but where's the main optimization taken place? (The reward model is fixed in this equation)

Nathan G
  • 161
  • 3

1 Answers1

1

About the first point, is most likely that the network will receive the full completion as you say, in order to estimate the reward (though, you can definitely just feed just $[x, next\, token]$, however the reward model will have a hard time giving a meaningful reward)

Regarding the second part, yes, $r_\theta(x,y)$ is constant, however the expectation is over $(x,y)$ pairs coming from $\pi_\phi^{RL}$, which changes over time, therefore the model has find a way to maximize the reward model, without moving too far from the SFT model

Alberto
  • 2,863
  • 5
  • 12