vLLM parameters

2 minute read

To clarify meanings of some vLLM parameters

1. max_model_len (Configuration Parameter)

  • This is a configuration setting defined in ModelConfig that specifies the maximum allowed context length (prompt + output) for the model.
    # model.pyLines 178-186
      max_model_len: SkipValidation[int] = None  # type: ignore    """Model context length (prompt and output). If unspecified, will be    automatically derived from the model config.    When passing via `--max-model-len`, supports k/m/g/K/M/G in human-readable    format. Examples:\n    - 1k -> 1000\n    - 1K -> 1024\n    - 25.6k -> 25,600"""
    
  • Set via --max-model-len CLI argument or derived from the model’s HuggingFace config
  • Acts as an upper bound/limit for how long sequences can be Used for memory allocation planning and validation
  • The default max_model_len is derived from the HuggingFace model config by checking these keys in order, and taking the smallest value found:
Model Type Config Key
LLaMA, Mistral, most modern models max_position_embeddings
GPT-2 n_positions
MPT max_seq_len
ChatGLM seq_length
Command-R, Cohere model_max_length
Whisper max_target_positions
  • Priority/Logic:
    • Check all keys above → take the minimum value found across all of them
    • Sliding window: If sliding window is disabled and the sliding window size is smaller, that becomes the limit
    • Tokenizer config: model_max_length from tokenizer_config.json is also considered (takes the min)
    • Fallback: If none of the keys are found, defaults to 2048 with a warning

2 max_seq_len (Runtime Value)

  • This is a runtime metadata value used in the attention backends that represents the actual longest sequence length in the current batch being processed.
    #utils.pyLines 77-80
      max_query_len: int    """Longest query in batch"""    max_seq_len: int    """Longest context length in batch"""
    
  • Computed dynamically for each batch: max_seq_len = seq_lens.max()
  • Used by attention kernels to optimize memory access patterns
  • Changes with every batch depending on the actual sequences being processed Key Difference.
Aspect max_model_len max_seq_len  
Scope Global config Per-batch runtime  
Purpose Upper bound/limit Actual max in current batch  
When set Server startup Every inference step  
Value Fixed (e.g., 4096)   Dynamic (≤ max_model_len)

In short: max_model_len is the ceiling you configure, while max_seq_len is the actual maximum sequence length in the current batch being processed (always ≤ max_model_len).

3 max_num_seq(Configuration Parameter)

  • Controls the maximum number of requests/sequences that can be batched together in a single forward pass.
    #scheduler.py Lines 44-49
    max_num_seqs: int = Field(default=DEFAULT_MAX_NUM_SEQS, ge=1)    """Maximum number of sequences to be processed in a single iteration.    The default value here is mainly for convenience when testing.    In real usage, this should be set in `EngineArgs.create_engine_config`.    """
    
  • Set via --max-num-seqs CLI argument and default is 128

    max_num_batched_tokens: int = Field(default=DEFAULT_MAX_NUM_BATCHED_TOKENS, ge=1)

4 max_num_batched_tokens

  • Constraint: max_num_batched_tokens >= max_num_seqs Each sequence needs at least 1 token, so you can’t have more sequences than tokens
  • Upper bound: max_num_batched_tokens <= max_num_seqs * max_model_len The theoretical max tokens is when all sequences are at max length

5 Example batch

Seq 1: [tokens…] 150 tokens
Seq 2: [tokens…] 200 tokens
Seq 3: [tokens…] 300 tokens

Seq N: [tokens…]

Constraints:

  • N ≤ 128 (max_num_seqs)
  • Each seq ≤ 4096 tokens (max_model_len)
  • Total tokens ≤ 8192 (max_num_batched_tokens)
  • Actually longest sequence in current batch = 300 (max_seq_len)
      # In gpu_model_runner.py
      max_seq_len = self.seq_lens.np[:num_reqs].max().item()
    

Alt text

Tags:

Categories:

Updated: