NVFP4 engine building

1 minute read

Got access to B200 for the first time and worked on building a NVFP4 TRTLLM engine and benchmark it against FB16 original version

0 HF download

Some notes on HF download

  1. Installation
    curl -LsSf https://hf.co/cli/install.sh | bash
    
  2. Dataset download
    f download --repo-type dataset org/dataset --local-dir dataset_local_folder
    
  3. Model download
    f download --repo-type model org/model --local-dir model_local_folder
    

1 Engine building w ModelOpt

Following this instruction

# Clone the Model Optimizer (ModelOpt)
git clone https://github.com/NVIDIA/Model-Optimizer.git
pushd Model-Optimizer
# install the ModelOpt
pip install -e .
# Quantize the Qwen3-235B-A22B model by nvfp4
# By default, the checkpoint would be stored in `Model-Optimizer/examples/llm_ptq/saved_models_Qwen3-235B-A22B_nvfp4_hf/`.
./examples/llm_ptq/scripts/huggingface_example.sh --model Qwen3-235B-A22B/ --quant nvfp4 
  1. --export_fmt hf is no longer supported
  2. Need HF_TOKEN set to retrieve nvidia/Nemotron-Post-Training-Dataset-v2
  3. The building is working inside the TRTLLM container with following startup code. Directly docker run would lead to NCCL errors
    docker run --rm -it \
    --ipc host \
    --gpus all \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -p 8000:8000 \
    nvcr.io/nvidia/tensorrt-llm/release:latest
    

    2 Model Hosting with trtllm-serve

    Use trtllm-serve will create an OpenAI compatible endpoint

    #! /bin/bash
    model_path=/path/to/llama3.1_70B
    extra_llm_api_file=/tmp/extra-llm-api-config.yml
    cat << EOF > ${extra_llm_api_file}
    enable_attention_dp: false
    print_iter_log: true
    cuda_graph_config:
      enable_padding: true
      max_batch_size: 1024
    kv_cache_config:
      dtype: fp8
    EOF
    trtllm-serve ${model_path} \
     --max_batch_size 1024 \
     --max_num_tokens 2048 \
     --max_seq_len 1024 \
     --kv_cache_free_gpu_memory_fraction 0.9 \
     --tp_size 1 \
     --ep_size 1 \
     --trust_remote_code \
     --extra_llm_api_options ${extra_llm_api_file}
    

3 Benchmark with trtllm-bench

This method currently is NOT working.
trtllm-bench --model --model_path

  1. --model is required
  2. --model_path is not working the local model cache

4 Benchmark with serve script

Current workaround is as following:
There is a benchmark script benchmark_serving/py in TRTLLM repo

  1. Pre-download shareGPT from HF as the random dataset
    concurrency_list="1 2 4 8 16 32 64 128 256"
    multi_round=5
    isl=1024
    osl=1024
    result_dir=/tmp/llama3.1_output
    model_path=/path/to/llama3.1_70B
    for concurrency in ${concurrency_list}; do
     num_prompts=$((concurrency * multi_round))
     python -m tensorrt_llm.serve.scripts.benchmark_serving \
         --model ${model_path} \
         --backend openai \
         --dataset-name "random" \
         --download-path "/raid/models/"
         --random-input-len ${isl} \
         --random-output-len ${osl} \
         --random-prefix-len 0 \
         --num-prompts ${num_prompts} \
         --max-concurrency ${concurrency} \
         --ignore-eos \
         --save-result \
         --result-dir "${result_dir}" \
         --result-filename "concurrency_${concurrency}.json" \
         --percentile-metrics "ttft,tpot,itl,e2el"
    done
    

Tags:

Categories:

Updated: