Deepseek V3 - MLA

1 minute read

Let’s summarize the learning of deepseek V3 from recent weeks

0 Testing datasets

  • MMLU-Pro: Univ of Waterloo’s Multi-Task Language Understanding Benchmark, enhaunced and released in 2024.12,000 Qs
  • GPQAGraduate-Level Google-Proof Q&A Benchmark. 448 Qs
  • Math-500 OpenAI’s Math test sets.
  • CodeForces Coding competition website
  • SWE-Bench OpenAI’s coding datasets from Github issues

1 Multi-Head Latent Attention

MLA was introduced in Deepseek V2 paper, and it’s Low Rank Decomposition for KV cache. Alt text You can directly calculate attention from the latent vector C, without restore it back to K/V. That’s the because the upscale matrix $W^{uk}$ can be absorbed into the weight matrix $W^q$ Alt text Query was also low-rank decomposed to save memory Alt text

3 Rotary Position Embedding

Two major branches of PE, absolute and relative. The absolute PE, pro is simple to implement, con is that if the training is short, then the LM can NOT know long position, weak in extensability Alt text RoPE is combining both absolute and relative PE. The left figure is the one without PE, middle is adding absolution PE, which can change the vector length. RoPE is designed to maintain vector length, and only applied to Q/K Alt text Here is the how RoPE is applied in math, $R_q$ and $R_k$ are rotation for Q, and K, which has the absolute PE attributes, and it can be combined as $R$ during attention calculation, which has the relative PEattributes. Alt text

But since the RoPE has position information, the K can no long be cached in the MLA schema. We have to decouple RoPE. From formula, you can see $W^{uk}$ can no long be absorbed by $W^q$ due to the position coupled $R$ Alt text

To solve this issue, we must recompute the keys for all the prefix tokens during inference, and concat with cached KV. That’s why the RoPE branch shows up in following arch. Alt text

Another discussion on HF showed the summation is better than concat Alt text

Tags:

Categories:

Updated: