NVFP4 engine building
Got access to B200 for the first time and worked on building a NVFP4 TRTLLM engine and benchmark it against FB16 original version
0 HF download
Some notes on HF download
- Installation
curl -LsSf https://hf.co/cli/install.sh | bash - Dataset download
f download --repo-type dataset org/dataset --local-dir dataset_local_folder - Model download
f download --repo-type model org/model --local-dir model_local_folder
1 Engine building w ModelOpt
Following this instruction
# Clone the Model Optimizer (ModelOpt)
git clone https://github.com/NVIDIA/Model-Optimizer.git
pushd Model-Optimizer
# install the ModelOpt
pip install -e .
# Quantize the Qwen3-235B-A22B model by nvfp4
# By default, the checkpoint would be stored in `Model-Optimizer/examples/llm_ptq/saved_models_Qwen3-235B-A22B_nvfp4_hf/`.
./examples/llm_ptq/scripts/huggingface_example.sh --model Qwen3-235B-A22B/ --quant nvfp4
--export_fmt hfis no longer supported- Need
HF_TOKENset to retrievenvidia/Nemotron-Post-Training-Dataset-v2 - The building is working inside the TRTLLM container with following startup code. Directly
docker runwould lead to NCCL errorsdocker run --rm -it \ --ipc host \ --gpus all \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ nvcr.io/nvidia/tensorrt-llm/release:latest2 Model Hosting with trtllm-serve
Use
trtllm-servewill create an OpenAI compatible endpoint#! /bin/bash model_path=/path/to/llama3.1_70B extra_llm_api_file=/tmp/extra-llm-api-config.yml cat << EOF > ${extra_llm_api_file} enable_attention_dp: false print_iter_log: true cuda_graph_config: enable_padding: true max_batch_size: 1024 kv_cache_config: dtype: fp8 EOF trtllm-serve ${model_path} \ --max_batch_size 1024 \ --max_num_tokens 2048 \ --max_seq_len 1024 \ --kv_cache_free_gpu_memory_fraction 0.9 \ --tp_size 1 \ --ep_size 1 \ --trust_remote_code \ --extra_llm_api_options ${extra_llm_api_file}
3 Benchmark with trtllm-bench
This method currently is NOT working.
trtllm-bench --model --model_path
--modelis required--model_pathis not working the local model cache
4 Benchmark with serve script
Current workaround is as following:
There is a benchmark script benchmark_serving/py in TRTLLM repo
- Pre-download
shareGPTfrom HF as the random datasetconcurrency_list="1 2 4 8 16 32 64 128 256" multi_round=5 isl=1024 osl=1024 result_dir=/tmp/llama3.1_output model_path=/path/to/llama3.1_70B for concurrency in ${concurrency_list}; do num_prompts=$((concurrency * multi_round)) python -m tensorrt_llm.serve.scripts.benchmark_serving \ --model ${model_path} \ --backend openai \ --dataset-name "random" \ --download-path "/raid/models/" --random-input-len ${isl} \ --random-output-len ${osl} \ --random-prefix-len 0 \ --num-prompts ${num_prompts} \ --max-concurrency ${concurrency} \ --ignore-eos \ --save-result \ --result-dir "${result_dir}" \ --result-filename "concurrency_${concurrency}.json" \ --percentile-metrics "ttft,tpot,itl,e2el" done