I'm having trouble understanding this paragraph of the DPO paper:
Why does it matter so much that the preference data distribution aligns with the reference model output distribution? My understanding is that during training, the parameters of the sft (supervised fine-tuning) are updated such that chosen responses ($y_w$) have a higher probability of being generated, and rejected responses ($y_l$) have a lower probability of being generated, and the reference model is just there to prevent the sft model from straying too far from the original parameters. But I fail to understand how the wrong reference distribution could hinder this process. Could someone please help me?
