AWQ version

#8
by celsowm - opened

Please,
Release AWQ version

Thanks !

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Would you mind to share your vllm params ?

Sign up or log in to comment