Max model len is 32768 when serving with vllm and not 40960
#19
by
f14
- opened
When serving with vllm as follows
vllm serve \
--host 0.0.0.0 \
--port 8080 \
--max-num-seqs 512 \
--max-model-len 40960 \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--tokenizer-mode mistral \
--load-format mistral \
--config-format mistral \
mistralai/Magistral-Small-2506
vllm fails during model loading with
-->
Value error, User-specified max_model_len (40960) is greater than the derived max_model_len (max_position_embeddings=32768 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
For further information visit https://errors.pydantic.dev/2.11/v/value_error
Is there a way to use the full 40960 model length with vllm?
you can override the max len setting by setting the variable VLLM_ALLOW_LONG_MAX_MODEL_LEN to 1
you can implement this into your startup script like this:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve [...]
Thanks! vLLM itself already suggests that solution. However, I'm wondering if this is the intended way to run magistral-small with vLLM.
If so, it might be worth noting in the model card that VLLM_ALLOW_LONG_MAX_MODEL_LEN should be set.