Llama2 tricks (2)

1 minute read

This is summary of explanantion of KV cache and RoPE in this video. I really like how Bai explained RoPE.

1 KV Cache

First review the meaning of Q/K/V Alt text

So for a word, it’s corresponding to a column in K and a row in V. It won’t change after it’s calculated. That’s why we can cache the previous results. Alt text So everytime we only need to add information related to new tokens. Alt text The size of the KV cache can be estimated by following formular Alt text

A 30B model can use up to 180GB, which is much larger than the model size. Alt text

TTFT is slower than ITL is due to no KV cache is available for the first token. I missed this question during interview…

2 RoPE

  1. Absolute Position Enbedding The issues is that 1, positions are bounded, and 2, there is no relative position relationship. Pos2 and Pos500 are all differnt from Pos1. Alt text
  2. Relative Positional Embedding Use the relative positions between words. But it’s slow b/c it introduces extra step in self-attention and changes in each step, so we can NOT use KV cache. Alt text
  3. Rotary Positional Embedding
    • Use ROTATION to represent the position of the word in a sentence. Alt text
    • The embedding won’t change when adding new tokens after it (so we can use KV cache!)
    • The relative position of the words are preserved.
      Alt text
    • The implemention of 2-d vectors is by the rotation matrix. and for high dim vectors, treate them as groups of 2-d vectors. Alt text
    • The actually implementions is simplified as below. Alt text Alt text

## 3 Beam Search

  • Greedy search without beam search. Only find the one w largest prop. Alt text
  • Beam search keep top K results at each step Alt text

Tags:

Categories:

Updated: