Llama2 tricks

1 minute read

A good youtube video explained several tricks applied in Llama2
Alt text Here is the study notes.

1. Layer normalization

Batch norm: normalized by columns (same feature, different data)
Layer norm: normalized by rows (same data, different features)
It can help to solve Internal Covariate Shift, which means drastically changes in the output, and it leads to drastically changes in input in the next layer
Alt text Rescale is more important than re-center, which leads to RMSnorm
Alt text

2. RoPE

RoPE(Rotary Position Encoding) is based on relative position encoding Alt text Absolute encoding is added to embeddings
Alt text Relative encoding is added to the key matrix, making it menage a trios Alt text RoPE is introduced in following paper, from a Chinese company called Zhuiyi(追一) Alt text And when trying find a formula for the relative position, we actually introduced the rotary formula, that’s how it gets the name Alt text

3. KV Cache

To avoid repeatly calculation for previous tokens, we cache the calculated results Alt text For regular inferences, tokens are repeatly calculated Alt text With KV cache, only the output token are put into next rounds’ calculation Alt text Alt text

4.MQA

MQA(Multi-Query Attention) is paper from Noam Shazeer, who is the CEO of Character.ai, and author for transformers’ paper and who invented multi-head attention. Alt text The GPU is too fast for the memory bandwidth to catchup Alt text This is NOT a problem for vanilla transformers Alt text But for mutli-head transformers, which is used in practice, the ratio O(n/d+1/b) may become bottle neck Alt text MQA is the solution here, which removes h dimension from K and V, but only keeps in Q Alt text From the comparison below, you can see Grouped Multi-Query is the combination of multi-qeury and multi-head transformers.
Alt text

5. SwiGLU

This is from Noam again Alt text It’s actually an activation function with modification from Sigmoid. Alt text Here is the most interesting results, why does this method work? It’s divine benevolence Alt text

Tags:

Categories:

Updated: