RL in 2025

less than 1 minute read

Happy $\pi$ Day! It’s time to review RL in 2025. This zhihu gives me a much clear review of value based and policy based methods. I guess the yearly review on RL also improves my understanding of RL

1 $\pi$ vs $V_\pi$

RL can be summarized as two steps

  1. Value evaluation: Given a policy $\pi$, how to correctly evaludate current value function $V_\pi$
  2. Policy iteration: Given a current value function $V_\pi$, how to improve policy $\pi$

We perform these two steps in turn till converge and get the best policy $\pi^$ and best value function $V_\pi^$ Alt text

Hence we can define value-based and policy-based method accordinng, or actor-critic for use it together. Alt text

2 Policy based RL

The policy based RL has the following target Alt text Going through some derivaties and remove items has zero derivate to policy(only $s$ related terms), we get following gradient Alt text

Here are the different variances of this gradient Alt text

  1. What we dicussed so far is using $\Psi$

Tags:

Categories:

Updated: