Eagle 1/2/3 + HASS

1 minute read

Speculative Decoding w Eagles

0 Medusa review

Zhihu explains Medusa building the tree attention has $\Sigma_{i=1}^N\Pi_{j=1}^i{C_i}$ branches ($N$ head and $C_i$ tokens for each head). So pruning is critial for Medusa, like 4 heads [10, 10, 9, 4], the path will drop from 4610 to 64.

1 Eagle 1

Instead of token level decoding, Medusa uses feature level decoding. Eagle use both token and feature features and also added causal This figure from Eagle paper shows the differences. Alt text Eagle builds a static draft tree and have multiple rounds of forward through the 1-layer transformer.

2 Eagle 2

Eagle2 modifies the static tree to dynamic. Alt text

Expand phase: Remove nodes with prob. less than a threshold
Rerank phase: Reranking all the left nodes and keep top K.

3 HASS

HArmonized Speculative Sampling (HASS) is to improve the gap of features used in training and in inferences. Zhihu explains it and introduce another work CORAL. Alt text So the solution is multi-step training, which is sending features from draft model to training.

4 Eagle 3

Eagle3 paper is inspired from HASS and zhihu summarized it as following improvments:

Used 3 layers of features
Multi-step training similar to HASS
Remove Smooth L1 for feature, which was used to bring gaps of train/infer feature gasp. Not needed w Multi-step training.

Twitter Facebook LinkedIn

Eagle 1/2/3 + HASS

0 Medusa review

1 Eagle 1

2 Eagle 2

3 HASS

4 Eagle 3

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX