Context Extension by YaRN

less than 1 minute read

LLM context length can be extended in the post training process. They are all RoPE based algorithem, like YaRN(Yet Another RoPE extensioN)

0 RoPE Review

Found another good video on RoPE and shows the key idea of rotate original embedding vector based on the absolute position in the sentence( ONly based on preceeding words) Alt text The advantage is that the relative position of the words are preserved no matter other context. Alt text For higher dimentions, we break it down to 2-dim pairs and define different $\theta$ to capture high and low frequence features Alt text Alt text Alt text

1 Position Interpolation

Meta published PI to extent the context window length beyond training. The key idea is to interpolate RoPE directly.It’s simple but insufficient at high freq. Alt text Let’s rewrite RoPE in complex number form Alt text and the PI is formulated as following Alt text

2 YaRN

YaRN was based on NTK(Neuron Tagent Kernel) idea and this video shows an early paper to explain this idea.

Following notes are mainly from zhihu

  1. NTK-aware method to deal with High-freq lost issue in PI Alt text
  2. NKT-by-parts Alt text
  3. NKT-Dynamic Alt text
  4. YaRN Alt text

Tags:

Categories:

Updated: