Max model len is 32768 when serving with vllm and not 40960

#19
by f14 - opened

When serving with vllm as follows

vllm serve \
    --host 0.0.0.0 \
    --port 8080 \
    --max-num-seqs 512 \
    --max-model-len 40960 \
    --dtype bfloat16 \
    --tensor-parallel-size 4 \
    --tokenizer-mode mistral \
    --load-format mistral \
    --config-format mistral \
    mistralai/Magistral-Small-2506

vllm fails during model loading with

-->
Value error, User-specified max_model_len (40960) is greater than the derived max_model_len (max_position_embeddings=32768 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
  For further information visit https://errors.pydantic.dev/2.11/v/value_error

Is there a way to use the full 40960 model length with vllm?

you can override the max len setting by setting the variable VLLM_ALLOW_LONG_MAX_MODEL_LEN to 1
you can implement this into your startup script like this:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve [...]

Thanks! vLLM itself already suggests that solution. However, I'm wondering if this is the intended way to run magistral-small with vLLM.
If so, it might be worth noting in the model card that VLLM_ALLOW_LONG_MAX_MODEL_LEN should be set.

Sign up or log in to comment