DPO - Direct Preference Optimization

1 minute read

Continue with another blog from Cameron and greate explanation and super easy to digest math details

1 Preference dataset

PD of prompts with an associated “chosen” and “rejected” response. Alt text and it can be integrated in to Bradley-Terry model Alt text

2 DPO in LLM Training

The DPO is replacing RLHF in the LLM training workflow Alt text The standard RFHF has a reward model within its objective function

  1. It uses explicit reward model
  2. Use PPO to train RL Alt text But with DPO, we use
  3. implicit reward model within the policy itsel
  4. indirectly derives the optimal policy from Alt text

Here is the loss function for DPO Alt text This is actually very close to the loss function we used to train reward model

3 Derive of DPO

  1. Deriving an expression for the optimal policy in RLHF. This step defines a partition function Alt text and defines optimal policyas below Alt text you can see that
    • The value of the optimal policy is ≥ 0 for all possible completions y.
    • The sum of the optimal policy across all completions y is equal to 1. Further derivation shows that the policy being optimal policy can minimized the RLHF objective Alt text
  2. Deriving an implicit reward. Re-arrange the defination of optimal policy, we can get the implicti reward function
  3. Plug into Bradley-Terry modal
  4. Training an LLM to match this implicit preference model—this is what we are doing in the DPO training process. Alt text

Tags:

Categories:

Updated: