Reward Model

1 minute read

Read some latest publish from Cameron’s blog and got some knowledge refresh for Reward model, DPO (will be my next blog) and review on PPO (like always)

0 Overview of how RW is used

Standard LLM post-training steps in RLHF:

Supervised finetuning (SFT)—a.k.a. instruction finetuning (IFT)
trains the model using next-token prediction over examples of good completions.
A reward model (RM) is trained over a human preference dataset
RL is used to finetune the LLM by using the output of the RM as a training signal.

1 Bradley-Terry Model of Preference

The RW is actually based on BT model, which I learnt during Elo score calculation. Alt text and RW function is used here and leads to standard training loss

2 Different Types of RMs

Train a custom classifier to serve as an RM
LLM-as-a-judge
Outcome Reward Models (ORMs), Unlike a standard RM that predicts the reward at a sequence level, the ORM predicts correctness on a per-token basis
Process Reward Models (PRMs) PRMs make predictions after every step of the reasoning process rather than after every token.Collecting training data for PRMs is difficult, as they require granular supervision.

3 RLVR（Reinforcement Learning with Verifiable Rewards）

Two criteriors for a verifiable results

has ground truth answer
has a rule-based technique that can be used to verify correctness

Twitter Facebook LinkedIn

Reward Model

0 Overview of how RW is used

1 Bradley-Terry Model of Preference

2 Different Types of RMs

3 RLVR（Reinforcement Learning with Verifiable Rewards）

You May Also Enjoy

Flow and Diffusion models Part 4 - Classifer-Free Guidence

Something about IRA

Flow and Diffusion models Part 3 - Langevin and Matching

Ray continue on Two H200x8 nodes