TensorRT-LLM Backend

less than 1 minute read

Watched Faradawn Yang’s sharing about TRTLLM in EZ channel, which is really well articulated. I furthur watched his vLLM and SGLang sharing, also inspiring

0 Overall view of these three backend

These three engines are actually optimizing inference from different perspective Alt text

1 How TRTLLM stands out

  1. Using vLLM load HF weight is like Python interpreator Alt text While TRTLLM compiles, like g++, the engine based on hardware type and execute later Alt text
  2. TRTLLM is using Kernal Auto Tuning, which can search over difference matrix size for multiplication. Alt text The chanllege could be huge search space Alt text but can be mitigated by limiting batchsize Alt text
  3. When not knowing the exactly batchsize Alt text The queue is the solution Alt text
  4. It also comes with multiple other compiling options Alt text

2 CUDA Grapha Capture

I heard this term mentinoed multiple and still not sure what’s the exactly meaning Here is a simple comparsion and will dive into later Alt text

Tags:

Categories:

Updated: