RL 2024-3

less than 1 minute read

Focus on PPO in this post

0 KL Divergence

KL divergence is actually related to entropy. Alt text It’s the loss functions for VAEs.

1 Trust Region Policy Optimization (TRPO)

Use a ratio, surrogate objective, between old and updated policies
added constraint based on the KL divergence
Solving a contrained maximization problem instead Gradient Ascent

Comparing to VPG, instead of working on parameter $\theta$ space, directly work on policies. Because small changes to $\theta$ can drastically alter the policy, ensuring that policy updates are small in the parameter space does not provide much of a guarantee on changes to the resulting policy.

2 Proximal Policy Optimization (PPO)

Define surrogate objective for TRPO as below Alt text PPO’s surragte objective is a clipped version

Twitter Facebook LinkedIn

RL 2024-3

0 KL Divergence

1 Trust Region Policy Optimization (TRPO)

2 Proximal Policy Optimization (PPO)

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX