Llama2 tricks

1 minute read

A good youtube video explained several tricks applied in Llama2
Alt text Here is the study notes.

1. Layer normalization

Batch norm: normalized by columns (same feature, different data)
Layer norm: normalized by rows (same data, different features)
It can help to solve Internal Covariate Shift, which means drastically changes in the output, and it leads to drastically changes in input in the next layer
Alt text Rescale is more important than re-center, which leads to RMSnorm

2. RoPE

RoPE(Rotary Position Encoding) is based on relative position encoding Alt text Absolute encoding is added to embeddings
Relative encoding is added to the key matrix, making it menage a trios RoPE is introduced in following paper, from a Chinese company called Zhuiyi（追一） And when trying find a formula for the relative position, we actually introduced the rotary formula, that’s how it gets the name Alt text

3. KV Cache

To avoid repeatly calculation for previous tokens, we cache the calculated results Alt text For regular inferences, tokens are repeatly calculated With KV cache, only the output token are put into next rounds’ calculation

4.MQA

MQA(Multi-Query Attention) is paper from Noam Shazeer, who is the CEO of Character.ai, and author for transformers’ paper and who invented multi-head attention. Alt text The GPU is too fast for the memory bandwidth to catchup This is NOT a problem for vanilla transformers But for mutli-head transformers, which is used in practice, the ratio O(n/d+1/b) may become bottle neck MQA is the solution here, which removes h dimension from K and V, but only keeps in Q Alt text From the comparison below, you can see Grouped Multi-Query is the combination of multi-qeury and multi-head transformers.

5. SwiGLU

This is from Noam again Alt text It’s actually an activation function with modification from Sigmoid. Here is the most interesting results, why does this method work? It’s divine benevolence

Twitter Facebook LinkedIn

Llama2 tricks

1. Layer normalization

2. RoPE

3. KV Cache

4.MQA

5. SwiGLU

You May Also Enjoy

VLM Nemotron-Nano-VL-8B support

Preprocessing config in vLLM

Weight loading in vLLM

ECS Deployment Details