Context Extension by YaRN

1 minute read

LLM context length can be extended in the post training process. They are all RoPE based algorithem, like YaRN(Yet Another RoPE extensioN)

0 RoPE Review

Found another good video on RoPE and shows the key idea of rotate original embedding vector based on the absolute position in the sentence( ONly based on preceeding words) Alt text The advantage is that the relative position of the words are preserved no matter other context. For instance, “I walk my dog”, adding prefix or suffix to this sentence, won’t change the relative position of “I” and “dog”, always will be $3\theta$ Alt text For higher dimentions, we break it down to 2-dim pairs and define different $\theta$ to capture high and low frequence features Alt text Alt text Alt text

1 Position Interpolation

Meta published PI to extent the context window length beyond training. The key idea is to interpolate RoPE directly.It’s simple but insufficient at high freq. Alt text Let’s rewrite RoPE in complex number form Alt text and the PI is formulated as following Alt text

2 YaRN

Yet Another RoPE extensioN was based on NTK(Neuron Tagent Kernel) and this video shows an early paper to explain this idea.

Following notes are mainly from zhihu and medium

  1. NTK-aware method to deal with High-freq lost issue in PI. So instead of uniform interpolation, we spread out the interpolation pressure across multiple dimensions by scaling high frequencies less and low frequencies more. Alt text
  2. NKT-by-parts proposes a solution by not interpolating higher frequency dimensions at all and always interpolating lower frequency dimensions. Alt text
  3. NKT-Dynamic Alt text
  4. YaRN = KNT-by-parts + logit temperator Alt text

Tags:

Categories:

Updated: