Mixture of Recursions

1 minute read

EZ’s talk about Mixture of Recursions

1 MoR = RNN + Transformers

Deepmind’s latest paper on MoR, which evolves from Recursive Transformer and MoD Alt text

The paper is having similar idea as MoE, which has a route to decide which layers to go through. So some tokens are go through more layers than some other tokens Alt text

Here are list of paper has relative techs around MoR Alt text

2 Recursive Transformers

It starts from Universal Transformers, which is combine RNN into Transformers Alt text One latest development on this tech is also from Google. THe resursive is adding loops on some layers. And the Relaxed Recursive Transformers is adding LoRA idea into it DIFFERENT LoRA were added so that parameters will be slightly different between recursive blocks

How to convert a Transfomer into a RT? Different ways to select layers are discussed in the paper Alt text

Continuous depth-wise batching is also introduced here, which is developed from Continuous sequence batching from Anyscale. Yes, Cade’s blog is shown up here again.

Model stage: means model layers here
Reduce GPU bubbles by depth-wise batching and early-exiting

3 Mixture of Depth

A route, similar to MoE is used to decide by-pass or calculate a block Early-Exit can skip some layer calculation, but limited to the end of layers MoD can skip layers at any position Alt text

Two routing schemas are used here

Token-choice-routing: Let token choose which expert to choose, which is easy to have bad load balance
Expert-choice-routing: Let expert choose token, which is balance aware.

This method is actually introduced by Google as Expert Choice MoE Alt text

The cons for token-choice is load-imbalanced, and expert-choice is leakage (causality violation): During training, you can always choice tokens based on expert, but during inferences, you only have limited tokens (already-generated-tokens) to choose from. Alt text The solution is adding a simple MLP to predict if a token is belongs to top-K

4 Result

Only shows benefit at large models with limited compute budget Alt text

Twitter Facebook LinkedIn

Mixture of Recursions

1 MoR = RNN + Transformers

2 Recursive Transformers

3 Mixture of Depth

4 Result

You May Also Enjoy

Flow and Diffusion models Part 4 - Classifer-Free Guidence

Something about IRA

Flow and Diffusion models Part 3 - Langevin and Matching

Ray continue on Two H200x8 nodes