What is the best quant for 2x RTX 4090s + 256GB RAM at 128K context?
Hi folks. I'm looking for advice for the best possible quant I can run on 2x RTX4090s + 256GB RAM at 128K context.
I can run UD-Q2_K_XL @ 128K context with -ot ".ffn_.*_exps.=CPU", but I would like to offload fewer layers to CPU to speed up prefill and generation. Right now I'm getting about 10-14 token/s generation (and ~400-600 tokens/s prefill).
Here's what I'm running right now:
./llama.cpp/llama-server \
--model unsloth/GLM-4.6-GGUF/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
--alias glm-4.6 \
--no-webui \
--threads 32 \
--ctx-size 128000 \
--no-context-shift \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--flash-attn on \
--batch-size 8192 \
--ubatch-size 8192 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--temp 1.0 \
--min-p 0.0 \
--top-p 0.95 \
--top-k 40 \
--jinja \
--no-mmap \
--cache-reuse 256 \
--prio 2 \
--parallel 1 \
--host 0.0.0.0 \
--port 3001
10-14 token/s is too slow to be practical. I am willing to take a small hit on quality for faster generation. Does anyone have any recommendations?
@juanquivilla remove KV cache quantization for now - it uses more vram and is also slower - I need to report this to llama.cpp.
To offload more layers see https://docs.unsloth.ai/models/deepseek-v3.1-how-to-run-locally#improving-generation-speed
Thanks a ton Daniel.
Interestingly enough, 4-bit quant for KV uses less VRAM for me, a bit less. I had to drop the ubatch and batch sizes to 4096 when removing KV cache in order to be able to run the model. I do get 1-2 tokens per second faster generation! But prefill is about 20% slower with the smaller batch/ubatch size.
I compiled the latest version of llama.cpp today, for whatever that's worth.
Thanks again for your advice and the fantastic unsloth work!
You can use binary search for offloading as max layers as possible. I've posted my llama.cpp scripts here https://github.com/ollama/ollama/issues/9957.
For example on my system with 128 GB RAM and 16 GB VRAM I am able to use only IQ2_XXS quant and lower quants.
That's fantastic, thanks for pointing me towards that Andrey. I'll give it a whirl tonight for my use-case. This is exactly what I was looking for.