Reward Model

1 minute read

Read some latest publish from Cameron’s blog and got some knowledge refresh for Reward model, DPO (will be my next blog) and review on PPO (like always)

0 Overview of how RW is used

Standard LLM post-training steps in RLHF:

  1. Supervised finetuning (SFT)—a.k.a. instruction finetuning (IFT)
    trains the model using next-token prediction over examples of good completions.
  2. A reward model (RM) is trained over a human preference dataset
  3. RL is used to finetune the LLM by using the output of the RM as a training signal.

1 Bradley-Terry Model of Preference

The RW is actually based on BT model, which I learnt during Elo score calculation. Alt text and RW function is used here Alt text and leads to standard training loss Alt text

2 Different Types of RMs

  1. Train a custom classifier to serve as an RM
  2. LLM-as-a-judge
  3. Outcome Reward Models (ORMs), Unlike a standard RM that predicts the reward at a sequence level, the ORM predicts correctness on a per-token basis Alt text
  4. Process Reward Models (PRMs) PRMs make predictions after every step of the reasoning process rather than after every token.Collecting training data for PRMs is difficult, as they require granular supervision.

3 RLVR(Reinforcement Learning with Verifiable Rewards)

Two criteriors for a verifiable results

  1. has ground truth answer
  2. has a rule-based technique that can be used to verify correctness Alt text

Tags:

Categories:

Updated: