Deepseek R1 - Training

1 minute read

Continue with Deepseek R1 from EZ Encoder. Link

0 Overview

R1 overview. The great achievment of R1 is making many previous ideas really work in LLM Alt text

1 RL without SFT

Teaching Large Language Models to Reason with Reinforcement Learning evaluate different RL methods for reasoning model, and tried RL without Reward Model, without SFT initialization. The result was relatively good, but nothing compared to R1 due to model size is very small 13B. (Sparse is result-only RM, and dense is process RM) Alt text

2 Training paradiam

The training paradiam is similiar to Llama 3, using Deepseek V3 as the reward model to clean reasoning data – Rejection Sampling Alt text

3 Distillation

LLM can self improve paper compared model distillation to a smaller model. Alt text

4 R1 Training

The RW in R1 is rule based, focusing on accuracy and formatting. Alt text There are 2 rounds of training for R1. Round 1 focusing in reasoning, Round 2 is adding on general capability. Use model trained in round 1 to generate reasoning data, and rejection sampling by V3. Have 800K of training data in total. QwQ 32B preview was used in the paper to compared with RL and distillation methods Alt text

Two failed attempts

PRW may lead to reward hacking
MCTS is not practical here due to large search space.

5 Followup works

s1: Simple test-time scaling thinks 600K data is too much, and 1K is enough. Used s1k dataset to generate reasonable good results. Also used “wait” replace “</think>” to generate long CoT as budget forcing, which is referred as sequential scaling. The parallel scaling is majority voting.
LIMO: Less is More for Reasoning only used 871 samples as “cognitive template” to post-train model.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs penalize LLM generating “Alternatively”, which is generated at the beginning of a throught, which is thinking too short to get the correct answer.