SGLang
How does Structured Generation Language for LLMachieve such great performances and how does it differentiate from vLLM.
Lianmin presented SGLang here explains key 4 technicals
0 History
SGLang history and milestones.
Key architecture is here, and will focus on server side
1 LM Programes
SGLang was designed to solve LM program constrains:
- Multiple LLM calls
- Need constrained decoding.
It has its own language primitives. It’s similar to LMQL and Guidance
Here is a full example using SGLang and it includes 3 key improvments (The 3rd one is for API calls only)
2 RadixAttention
KV cache is highly repeatable
The Radix Tree is a comporess prefix tree, and it’s used to store KV cache for resue.
3 Compressed FSM
FSM from Outlines decode one token at a time. A improvement is from Guidance to do interleaved-based decoding.
Jump-Foward Decoding is a compressed FSM combineing these two methods
4 Speculative API calls
This is a trival trick of ignore end token and let API output more contents for possible future use by prompting.
Pytorch compile is also one thing mentioned by Lianmin. and the following 3 optimization are not included in the orignal paper, and details are found in this video by Yineng Zhang.
5 CPU Overlap Optimization
6 FlashInfer Hopper Optimization and Integration
FlashInfer gives better perf than Triton implementation.
Key improved from it are
StreamK are SM level optimizations.
7 TurboMind GEMM Optimization and Integration
LDSM is the CUDA code for “Load Matrix from Shared Memory with Element Size Expansion”
Improvement on GEMM(GEneral Matrix to Matrix) is totally out of my knowledge