Deepseek R1 - GPT history

1 minute read

Continue with Deepseek R1 from EZ Encoder Link

0 Kimi K1.5

Flood Sung answered how K1.5 was trained on Zhihu, explained how they learned from OAI o1 to get Long Chain of Thoughts.

Both Noam and Richard emphasis on search, instead of any structured methods like MCTS. Don’t be limited by reward model in RL due to reward hacking limitatinos. So

Alt text

1 Inductive Bias

Both RNN and CNN has structured bias built in while Transformers doesn’t have it, using only attention and MLP

2 The Bitter Lesson

Richard Sutton proposed use general method, which is easy to scale up, to improve performance Alt text

3 History

Alt text Bert goes for traditional Encoder-Decoder pattern

GPT-1 paper, Improving Language Understanding by GPT. Alec made historical move of applying transformer’s decoder. Alt text GPT-2 paper, Language Models are Unsupervised Multitask Learners. Use QA mode for all tasks, and see scaling law showing up. GPT-3 paper, Language Models are Few-Shot Learners, even zero shot can see improvement from scaling law. Google released their decoder only model, PaLM, with larger size. Similiar, Deepseek also incresae model sizes. Meta’s OPT-175B has many failes. Alt text
Instruct GPT paper, Training Language models to follow instructions with human feedback, critial RL during training: first SFT, then train a reward model, and RL through PPO.

Reward hacking shows up when overoptimized. It’s summarized in OAI paper, Leraning to summarize from human feedback. Alt text

This is the overall paradim of LLM training from OAI Alt text

4 DPO

DPO, Direct Preference Optimization, your language model is Secretly a Reward Model, was using by Llama 3, so can skip the reward model training, and out of RL schema. Alt text

5 Emergency

Google published Emergent Abilities of LLM in 2022 Alt text

Twitter Facebook LinkedIn

Deepseek R1 - GPT history

0 Kimi K1.5

1 Inductive Bias

2 The Bitter Lesson

3 History

4 DPO

5 Emergency

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX