RL in 2025
Happy $\pi$ Day! It’s time to review RL in 2025. This zhihu gives me a much clear review of value based and policy based methods. I guess the yearly review on RL also improves my understanding of RL
1 $\pi$ vs $V_\pi$
RL can be summarized as two steps
- Value evaluation: Given a policy $\pi$, how to correctly evaludate current value function $V_\pi$
- Policy iteration: Given a current value function $V_\pi$, how to improve policy $\pi$
We perform these two steps in turn till converge and get the best policy $\pi^$ and best value function $V_\pi^$
Hence we can define value-based and policy-based method accordinng, or actor-critic for use it together.
2 Policy based RL
The policy based RL has the following target
Going through some derivaties and remove items has zero derivate to policy(only $s$ related terms), we get following gradient
Here are the different variances of this gradient
- What we dicussed so far is using $\Psi$