3

My rough understanding of RLHF as used for ChatGPT in a nutshell is this:

  1. A reward model is trained using comparisons of different responses to the same prompt. Human trainers rank these responses based on quality.

  2. The reward model is a neural network that learns to predict these human rankings. It essentially learns the "policy" that human trainers use to rank responses.

  3. An initial policy, which is a language model, is fine-tuned using Proximal Policy Optimization (PPO) with the reward model providing the reward signal. This process is iterative, with the policy and reward model being updated alternately.

  4. The policy is then used to generate responses to prompts. The reward model assesses these responses and provides a reward signal, which is used to further fine-tune the policy, i.e. the language model.

My main question is the first one, the others are just for giving context:

1. What's the architecture and size of the neural-network-based reward model?

  1. Is it pretrained, too? Is it possibly another pretrained (foundational) language model?

  2. By how many samples labelled by human trainers is the reward model trained?

  3. By how many prompts and rewarded completions is the language model trained later? (Which prompts, by the way?)

These numbers I'd like to compare with the numbers of pretrained ChatGPT:

  • Transformer-based ChatGPT has 175 billion weights.

  • It was pretrained on 500 GB of text data, distributed over an unknown number of "documents" (from single tweets to the Holy Bible) with roughly 500B tokens over all. During training ChatGPT was exposed to a multiple of 500B samples (assuming that all 500B tokens were used for training).

I assume that during RLHF foundational ChatGPT was exposed to a much smaller number of prompts to complete (and to be rewarded).

Hans-Peter Stricker
  • 931
  • 1
  • 8
  • 23

1 Answers1

1

If you haven't already, I would recommend a careful reading of OpenAI's paper on InstructGPT. This was their publication from last year regarding how they applied RLHF to GPT-3, the precursor of ChatGPT.

The appendix provides information on the reward model and the RLHF training data. For example,

For the reward models and value functions, the unembedding layer of the original model is replaced with a projection layer to output a scalar value.

The final reward model was initialized from a 6B GPT-3 model that was fine-tuned on a variety of public NLP datasets (ARC, BoolQ, CoQA, DROP, MultiNLI, OpenBookQA, QuAC, RACE, and Winogrande).

and,

We train all the RL models for 256k episodes. These episodes include about 31k unique prompts, after filtering out prompts with PII and deduplication based on common prefixes.

If you want to know what ChatGPT does specifically, you might have to ask someone who works there. It's not public information.

Venna Banana
  • 406
  • 1
  • 3