Questions tagged [rlhf]

For questions related to RLHF: Reinforcement Learning from Human Feedback

For questions related to Reinforcement Learning from Human Feedback (RLHF).

8 questions
4
votes
1 answer

Why do we need RL in RLHF?

In RLHF, the reward function is a neural network. This means we can compute its gradients cheaply and accurately through backpropagation. Now, we want to find a policy that maximizes reward (see https://arxiv.org/abs/2203.02155). Then, why do we…
3
votes
2 answers

What is the difference betwen fine runing and rlhf for llm?

I am confused about the difference betwen fine runing and rlhf for llm. When to use what? I know RLHF need to creating a reward model which at furst rates responses to align the responses to the human preferences and afterward using this reward…
2
votes
1 answer

What is the benefit of group-relative advantages in GRPO when their sum is close to zero?

GRPO algorithm (simplified by removing clipping) defines the following objective: $$ \dfrac{1}{G} \sum_{i=1}^{G} \dfrac{1}{|o_i|} \sum_{t=1}^{|o_i|} (A_{i,t} - \beta KL) $$ with Advantage $A_{i,t}$ being calculated as: $$A_{i,t} = \dfrac{r_i -…
Nitin
  • 23
  • 3
1
vote
0 answers

Which framework should I use for training transformer language models with reinforcement learning?

Which framework should I use for training transformer language models with reinforcement learning (e.g., GRPO)? Any recommendation? Feature trl (Hugging Face) unsloth verl (Volcano Engine) openrlhf Role in GRPO Full GRPO framework, implements…
1
vote
1 answer

understanding the distribution shift problem in direct preference optimization (DPO)

I'm having trouble understanding this paragraph of the DPO paper: Why does it matter so much that the preference data distribution aligns with the reference model output distribution? My understanding is that during training, the parameters of the…
1
vote
0 answers

When reading on RLHF, I came across this formula but can't break it down

What is the exact meaning of this expression? I'm unsure on the notation. I believe E[R(s)] is expected value of reward of state s, but I'm unsure what the subscript under the E means.
Ryan Marr
  • 11
  • 2
1
vote
0 answers

Negative KL-divergence RLHF implementation

I am struggling to understand one part of the FAQ of the transformer reinforcement learning library from HuggingFace: What Is the Concern with Negative KL Divergence? If you generate text by purely sampling from the model distribution things work…
0
votes
1 answer

can we apply soft actor critic to language model preference optimization?

Most language model is using online PPO or offline DPO type algorithm. Can we use soft actor critic RL to do alignment work? Any publication related can be recommended?