Reinforcement Learning - Policy Gradient

1 minute read

Learning RL is actually my very first ML project since joined AWS. DeepRace was released at reInvent 2018, and our prototype team got hand on it at early 2019. The task is straighforward: we are gonna do a hackathon with DeepRacer with one of the largest manufactory customer, and MS will be there as well. So the demo and contents needs to be shining and impressive.

So, here we go, Q-learning, Bellman equation, DQN, PPO, and …oh, it’s actually the first time I know a libary called Ray, which RLlib was used in the DeepRacer training source code

Here are some notes I took from review RL from Dr Hongyi Lee’s course:

  1. Trajectory is combine of {states and actions}, and probability of a tracjectory under a defined policy, which parametered by $\theta$, can be calculated as follow
    Alt text
  2. Expected reward can be calculated with accumulated reward, weighted by prob. of trajectory
    Alt text
  3. Policy Gradient is taking gradient over the expected reward. Doing some math tricks to move $P_\theta(\tau)$ into a log and a rough estimation of expected value by sampling Alt text
  4. During implementations, you need to get states and actions
    Alt text
  5. Consider it as classification problem, so minimized cross-entropy, is essentially maximize likelihood, so in RL, we just weighted the likelihood by expected reward for the whole episode.
    Alt text
  6. Tip 1, add a basedline to avoid reward always be positive , so that non-sampled action will NOT stay constant. Alt text
  7. Tip 2, only consider rewards from $t$ time on (ignore actions early on), and adding discount factor $\gamma$. Thus we can define advantage function Alt text

Tags:

Categories:

Updated: