NVFP4 engine building

1 minute read

Got access to B200 for the first time and worked on building a NVFP4 TRTLLM engine and benchmark it against FB16 original version

0 HF download

Some notes on HF download

Installation

curl -LsSf https://hf.co/cli/install.sh | bash

Dataset download

f download --repo-type dataset org/dataset --local-dir dataset_local_folder

Model download

f download --repo-type model org/model --local-dir model_local_folder

1 Engine building w ModelOpt

Following this instruction

# Clone the Model Optimizer (ModelOpt)
git clone https://github.com/NVIDIA/Model-Optimizer.git
pushd Model-Optimizer
# install the ModelOpt
pip install -e .
# Quantize the Qwen3-235B-A22B model by nvfp4
# By default, the checkpoint would be stored in `Model-Optimizer/examples/llm_ptq/saved_models_Qwen3-235B-A22B_nvfp4_hf/`.
./examples/llm_ptq/scripts/huggingface_example.sh --model Qwen3-235B-A22B/ --quant nvfp4 

--export_fmt hf is no longer supported
Need HF_TOKEN set to retrieve nvidia/Nemotron-Post-Training-Dataset-v2

The building is working inside the TRTLLM container with following startup code. Directly docker run would lead to NCCL errors

docker run --rm -it \
--ipc host \
--gpus all \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
nvcr.io/nvidia/tensorrt-llm/release:latest

2 Model Hosting with trtllm-serve

Use trtllm-serve will create an OpenAI compatible endpoint

#! /bin/bash
model_path=/path/to/llama3.1_70B
extra_llm_api_file=/tmp/extra-llm-api-config.yml
cat << EOF > ${extra_llm_api_file}
enable_attention_dp: false
print_iter_log: true
cuda_graph_config:
  enable_padding: true
  max_batch_size: 1024
kv_cache_config:
  dtype: fp8
EOF
trtllm-serve ${model_path} \
 --max_batch_size 1024 \
 --max_num_tokens 2048 \
 --max_seq_len 1024 \
 --kv_cache_free_gpu_memory_fraction 0.9 \
 --tp_size 1 \
 --ep_size 1 \
 --trust_remote_code \
 --extra_llm_api_options ${extra_llm_api_file}

3 Benchmark with trtllm-bench

This method currently is NOT working.
trtllm-bench --model --model_path

--model is required
--model_path is not working the local model cache

4 Benchmark with serve script

Current workaround is as following:
There is a benchmark script benchmark_serving/py in TRTLLM repo

Pre-download shareGPT from HF as the random dataset

concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/llama3.1_output
model_path=/path/to/llama3.1_70B
for concurrency in ${concurrency_list}; do
 num_prompts=$((concurrency * multi_round))
 python -m tensorrt_llm.serve.scripts.benchmark_serving \
     --model ${model_path} \
     --backend openai \
     --dataset-name "random" \
     --download-path "/raid/models/"
     --random-input-len ${isl} \
     --random-output-len ${osl} \
     --random-prefix-len 0 \
     --num-prompts ${num_prompts} \
     --max-concurrency ${concurrency} \
     --ignore-eos \
     --save-result \
     --result-dir "${result_dir}" \
     --result-filename "concurrency_${concurrency}.json" \
     --percentile-metrics "ttft,tpot,itl,e2el"
done

Twitter Facebook LinkedIn

NVFP4 engine building

0 HF download

1 Engine building w ModelOpt

2 Model Hosting with trtllm-serve

3 Benchmark with trtllm-bench

4 Benchmark with serve script

You May Also Enjoy

Claude Skills

Langevin Sampling in Diffuion models

TensorRT-LLM Backend

Diffusion Transformers