SGLang

1 minute read

How does Structured Generation Language for LLMachieve such great performances and how does it differentiate from vLLM.

Lianmin presented SGLang here explains key 4 technicals

0 History

SGLang history and milestones. Alt text Alt text Key architecture is here, and will focus on server side Alt text

1 LM Programes

SGLang was designed to solve LM program constrains:

  1. Multiple LLM calls
  2. Need constrained decoding.

Alt text

It has its own language primitives. It’s similar to LMQL and Guidance Alt text

Here is a full example using SGLang and it includes 3 key improvments (The 3rd one is for API calls only) Alt text

2 RadixAttention

KV cache is highly repeatable Alt text The Radix Tree is a comporess prefix tree, and it’s used to store KV cache for resue. Alt text

3 Compressed FSM

FSM from Outlines decode one token at a time. A improvement is from Guidance to do interleaved-based decoding. Alt text

Jump-Foward Decoding is a compressed FSM combineing these two methods Alt text

4 Speculative API calls

This is a trival trick of ignore end token and let API output more contents for possible future use by prompting.

Pytorch compile is also one thing mentioned by Lianmin. and the following 3 optimization are not included in the orignal paper, and details are found in this video by Yineng Zhang.

5 CPU Overlap Optimization

Alt text

6 FlashInfer Hopper Optimization and Integration

FlashInfer gives better perf than Triton implementation. Alt text Key improved from it are Alt text StreamK are SM level optimizations. Alt text

7 TurboMind GEMM Optimization and Integration

Alt text LDSM is the CUDA code for “Load Matrix from Shared Memory with Element Size Expansion” Alt text Improvement on GEMM(GEneral Matrix to Matrix) is totally out of my knowledge

Tags:

Categories:

Updated: