Reinforcement Learning - PPO

1 minute read

PPO, Proximal Policy Optimizatoin. One of the most powerful RL algorithm, and the default RL training algorithm by OpenAI.

It all starts with on-policy strategy moves to off-policy. One good example here from 棋魂 to demonstrate the difference of these two. Alt text

The advantage of moving to off-policy is obvious, sample data can be reused with the fixed policy $\pi_{\theta’}$ while you are updating parameter $\theta$. And a math trick called importance sampling is introduced, which is very straigtforward.
Alt text

Here is how we use the trick in the deduction, and Bayes rule, of course. $p_\theta(s_t)$ and $p_{\theta’}(s_t)$ are cancelled out, NOT because they are equall, just because we can not calculate the probablity of certain states. Then we can get the objective function $J^{\theta’}(\theta)$ by reversing the gradient.

  • $\theta$ is the parameter we are going to update
  • $\theta’$ is the fixed parameter we used to sample Alt text

So now you basically already get the formula for PPO, which under one condition is that for importance sampling to work, two distribution $p$ and $q$ need to be close distributions. So we use KL divergence to regulate the objective function. TRPO is similar.
Alt text

Again, here is the full implementation of PPO. Alt text

and imporved PPO with clipping. (The red line is the return value of the clip function) Alt text


