AWQ version

by celsowm - opened 3 days ago

Discussion

celsowm

3 days ago

Please,
Release AWQ version

Thanks !

stelterlab

about 13 hours ago

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

stelterlab

about 12 hours ago

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

celsowm

about 8 hours ago

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Would you mind to share your vllm params ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment