Quantization

less than 1 minute read

Quantization with TRT-LLM can be achieved by customized engine built. You can get INT8 on A100 and FP8 on H100. This step is replacing convert_checkpoint.py

python3 /app/tensorrt_llm/examples/quantization/quantize.py --model_dir ${MODEL_CHECKPOINT} \
  --output_dir ${CONVERTED_CHECKPOINT} \
  --dtype ${DTYPE} \
  --tp_size ${TP} \
  --qformat int8_sq \
  --kv_cache_dtype int8 \
  --calib_size 512

and the engine built is following similar steps

trtllm-build --checkpoint_dir ${CONVERTED_CHECKPOINT} \
             --output_dir ${ENGINE}  \
             --max_batch_size ${MAX_BATCH} \
             --max_num_tokens ${MAX_LEN} \
             --gemm_plugin auto \
             --workers 2

Tags:

Categories:

Updated: