TensorRL-LLM

1 minute read

To understand NIM, you can not avoid deep undertanding of TRT-LLM, Triton and even vLLM. Those will be focus for the near future.

This is more like a note of how to hosting a LLM on Triton with TRTLLM backend

1 Build the Container

The workspace is a container built for Triton Server. The instructions is for TensorRT-LLM backend.

Build the TensorRT-LLM Backend from source
-> This should only build TRTLLM backend
Build the Docker Container Option 1. Build the NGC Triton TRT-LLM container
-> This build the exactly same container on NGC Option 2. Build via Docker
-> I actually built the container with this method

2 Build the engine

The engine built is with trtllm-build command and multiple examples can be found

3 Host with Triton

This Llama example shows the whole process, including engine building. The triton host part is about creating proper config.pbtxt files for each process steps

Preprocessing
Postprocessing
tensorrt_llm_bls or ensamble
- Choose one with pre/postprocessing
tensorrt_llm
- This is standalone

4 Endpoint test

# Choose one model
MODEL_NAME='ensamble' 
MODEL_NAME='tensorrt_llm_bls'

curl -X POST localhost:8000/v2/models/${MODEL_NAME}/generate \
  -d '{ 
    "text_input": "What is machine learning?", 
    "max_tokens": 20,
    "bad_words": "",
    "stop_words": ""
    }'

5 Real examples

MODEL_CHECKPOINT="mixtral-8x7b-instruct-v01_vhf-a60832c-b"
OUTPUT_MODEL="MRG-Mistral8x7"

CONVERTED_CHECKPOINT="${MODEL_CHECKPOINT}-converted"
TOKENIZER=${MODEL_CHECKPOINT}


DTYPE=float16
TP=2
PP=1
MAX_LEN=13000
MAX_BATCH=128
MAX_LORA_RANK=32


SUFFIX=/trtllm_engine
ENGINE=${OUTPUT_MODEL}/${SUFFIX}

# Build lora enabled engine
python3 /app/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${MODEL_CHECKPOINT} \
  --output_dir ${CONVERTED_CHECKPOINT} \
  --dtype ${DTYPE} \
  --tp_size ${TP} \
  --moe_tp_size ${TP}

trtllm-build \
  --checkpoint_dir ${CONVERTED_CHECKPOINT} \
  --output_dir ${ENGINE} \
  --max_num_tokens $MAX_LEN --use_paged_context_fmha enable \
  --gemm_plugin float16

Twitter Facebook LinkedIn

TensorRL-LLM

1 Build the Container

2 Build the engine

3 Host with Triton

4 Endpoint test

5 Real examples

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX