AWQ version
Please,
Release AWQ version
Thanks !
The AWQ quant tools do not support vision models yet AFAIK.
I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).
Support by the Mistral AI team on llm-compressor would be nice.
Well, the experimental script for creating a FP8 quant did work.
For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.
vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.
Well, the experimental script for creating a FP8 quant did work.
For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.
vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.
Would you mind to share your vllm params ?