Mixture of Recursions

1 minute read

EZ’s talk about Mixture of Recursions

1 MoR = RNN + Transformers

Deepmind’s latest paper on MoR, which evolves from Recursive Transformer and MoD Alt text

The paper is having similar idea as MoE, which has a route to decide which layers to go through. So some tokens are go through more layers than some other tokens Alt text

Here are list of paper has relative techs around MoR Alt text

2 Recursive Transformers

It starts from Universal Transformers, which is combine RNN into Transformers Alt text One latest development on this tech is also from Google. THe resursive is adding loops on some layers. And the Relaxed Recursive Transformers is adding LoRA idea into it Alt text DIFFERENT LoRA were added so that parameters will be slightly different between recursive blocks

How to convert a Transfomer into a RT? Different ways to select layers are discussed in the paper Alt text

Continuous depth-wise batching is also introduced here, which is developed from Continuous sequence batching from Anyscale. Yes, Cade’s blog is shown up here again.

  • Model stage: means model layers here
  • Reduce GPU bubbles by depth-wise batching and early-exiting Alt text

3 Mixture of Depth

A route, similar to MoE is used to decide by-pass or calculate a block Early-Exit can skip some layer calculation, but limited to the end of layers MoD can skip layers at any position Alt text

Two routing schemas are used here

  • Token-choice-routing: Let token choose which expert to choose, which is easy to have bad load balance
  • Expert-choice-routing: Let expert choose token, which is balance aware. Alt text

This method is actually introduced by Google as Expert Choice MoE Alt text

The cons for token-choice is load-imbalanced, and expert-choice is leakage (causality violation): During training, you can always choice tokens based on expert, but during inferences, you only have limited tokens (already-generated-tokens) to choose from. Alt text The solution is adding a simple MLP to predict if a token is belongs to top-K Alt text

4 Result

Only shows benefit at large models with limited compute budget Alt text

Tags:

Categories:

Updated: