ML101 -2

less than 1 minute read

This blog is mainly about optimizers. It’s good to review them all. Overall problem to be solved, different parameters need different learning rate Alt text

1 AdaGrad

Adapted learning rate, learing rate is customized based on values of gradient.
Alt text

2 RMSProp

There is no paper reference to this method, interesting. The key idea is to add weight on top of AdaGrad.
Alt text

3 Adam

OK, this is the most popular optimizer. and it’s simply is the combination of two preivous methods.
Adam = RMSProp + Momentum Alt text

4 Learning rate scheduling: Warm up

Warm up has been used in ancient papers like Residual Network and Transformers. Alt text and RAdam paper have more details Alt text

5 Summary

All these methods here are focusing on how to avoid local min in grooved surface. Next chaper will focus on how to smooth the surface Alt text

Tags:

Categories:

Updated: