DPO - Direct Preference Optimization

1 minute read

Continue with another blog from Cameron and greate explanation and super easy to digest math details

1 Preference dataset

PD of prompts with an associated “chosen” and “rejected” response. Alt text and it can be integrated in to Bradley-Terry model

The DPO is replacing RLHF in the LLM training workflow Alt text The standard RFHF has a reward model within its objective function

Here is the loss function for DPO Alt text This is actually very close to the loss function we used to train reward model

Deriving an expression for the optimal policy in RLHF. This step defines a partition function and defines optimal policyas below you can see that
- The value of the optimal policy is ≥ 0 for all possible completions y.
- The sum of the optimal policy across all completions y is equal to 1. Further derivation shows that the policy being optimal policy can minimized the RLHF objective
Deriving an implicit reward. Re-arrange the defination of optimal policy, we can get the implicti reward function
Plug into Bradley-Terry modal
Training an LLM to match this implicit preference model—this is what we are doing in the DPO training process.