1

I'm having trouble understanding this paragraph of the DPO paper:

enter image description here

Why does it matter so much that the preference data distribution aligns with the reference model output distribution? My understanding is that during training, the parameters of the sft (supervised fine-tuning) are updated such that chosen responses ($y_w$) have a higher probability of being generated, and rejected responses ($y_l$) have a lower probability of being generated, and the reference model is just there to prevent the sft model from straying too far from the original parameters. But I fail to understand how the wrong reference distribution could hinder this process. Could someone please help me?

Ivy Cao
  • 11
  • 2

1 Answers1

1

I had a problem concerning this and I think I can shed some light onto your question.

We had a dataset of preferences generated for a really bad model, so naturally $y_w$ and $y_l$ were bad responses. If we were to use such preferences to train an already good SFT model, we would be looking into $\pi(y_w\vert x)$ or $\pi(y_l\vert x)$ for our good model, which is probably an extremely small number!

In a sense we are telling it "you should prefer to generate this bad answer rather than this atrocious one". This isn't helpful, and in my opinion will probably make the model worse.

This can also happen if the model that the DPO data is coming from is radically different from the one we're fine tuning. If they produce really different text, $\pi(y_w\vert x)$ will again be extremely small, and we'd be comparing sentences that our network is highly unlikely to produce.

Because of this I think it's highly desirable to evaluate responses coming from the same distribution, or at least a "similar" one. Hope this clarifies stuff for you.