What is the best quant for 2x RTX 4090s + 256GB RAM at 128K context?

by juanquivilla - opened Oct 2

Oct 2

Hi folks. I'm looking for advice for the best possible quant I can run on 2x RTX4090s + 256GB RAM at 128K context.

I can run UD-Q2_K_XL @ 128K context with -ot ".ffn_.*_exps.=CPU", but I would like to offload fewer layers to CPU to speed up prefill and generation. Right now I'm getting about 10-14 token/s generation (and ~400-600 tokens/s prefill).

Here's what I'm running right now:

./llama.cpp/llama-server \
    --model unsloth/GLM-4.6-GGUF/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
    --alias glm-4.6 \
    --no-webui \
    --threads 32 \
    --ctx-size 128000 \
    --no-context-shift \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --flash-attn on \
    --batch-size 8192 \
    --ubatch-size 8192 \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --temp 1.0 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 40 \
    --jinja \
    --no-mmap \
    --cache-reuse 256 \
    --prio 2 \
    --parallel 1 \
    --host 0.0.0.0 \
    --port 3001

10-14 token/s is too slow to be practical. I am willing to take a small hit on quality for faster generation. Does anyone have any recommendations?

danielhanchen

Unsloth AI org Oct 2

@juanquivilla remove KV cache quantization for now - it uses more vram and is also slower - I need to report this to llama.cpp.

To offload more layers see https://docs.unsloth.ai/models/deepseek-v3.1-how-to-run-locally#improving-generation-speed

juanquivilla

Oct 2

Thanks a ton Daniel.

Interestingly enough, 4-bit quant for KV uses less VRAM for me, a bit less. I had to drop the ubatch and batch sizes to 4096 when removing KV cache in order to be able to run the model. I do get 1-2 tokens per second faster generation! But prefill is about 20% slower with the smaller batch/ubatch size.

I compiled the latest version of llama.cpp today, for whatever that's worth.

Thanks again for your advice and the fantastic unsloth work!

puchuu

Oct 2

•

edited Oct 2

You can use binary search for offloading as max layers as possible. I've posted my llama.cpp scripts here https://github.com/ollama/ollama/issues/9957.

For example on my system with 128 GB RAM and 16 GB VRAM I am able to use only IQ2_XXS quant and lower quants.

juanquivilla

Oct 2

That's fantastic, thanks for pointing me towards that Andrey. I'll give it a whirl tonight for my use-case. This is exactly what I was looking for.

juanquivilla changed discussion status to closed Oct 2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment