Context Extension by YaRN
LLM context length can be extended in the post training process. They are all RoPE based algorithem, like YaRN(Yet Another RoPE extensioN)
0 RoPE Review
Found another good video on RoPE and shows the key idea of rotate original embedding vector based on the absolute position in the sentence( ONly based on preceeding words)
The advantage is that the relative position of the words are preserved no matter other context.
For higher dimentions, we break it down to 2-dim pairs and define different $\theta$ to capture high and low frequence features
1 Position Interpolation
Meta published PI to extent the context window length beyond training. The key idea is to interpolate RoPE directly.It’s simple but insufficient at high freq.
Let’s rewrite RoPE in complex number form
and the PI is formulated as following
2 YaRN
YaRN was based on NTK(Neuron Tagent Kernel) idea and this video shows an early paper to explain this idea.
Following notes are mainly from zhihu
- NTK-aware method to deal with High-freq lost issue in PI
- NKT-by-parts
- NKT-Dynamic
- YaRN