ubergarm/GLM-4.6-GGUF · Performance on Gaming Rig

Owner Oct 2

I'm sad there is no GLM-4.6-Air (unlikely it will be released, but who knows), so instead I cooked the ubergarm/GLM-4.6-smol-IQ2_KS 97.990 GiB (2.359 BPW) quant which is just a little bigger than full Q8_0 Air to run on my local gaming rig with 96GB RAM + 24 GB VRAM.

It is running well up to about 32k context or you can do some trade-offs for more PP at the cost of TG speed. Here is a llama-sweep-bench showing how quantizing kv-cache gives a steeper drop off on TG for this architecture similar to GLM-4.5.

Its pretty amazing how good of local models we can run on gaming rigs now! Have fun everyone!

phakio

Oct 3

•

edited Oct 3

Interesting, honestly even with the perplexity difference I wonder if this model at a quant of IQ1 / IQ2 would perform the same if not better than an Air variant of a better quant yet the same filesize.
Also, theoretically the whole thinking process could help bring back some of the loss of accuracy just with how the thinking process helps refine an answer. (however if the initial thinking assumption it hardens is wrong then it could just think itself into thinking the incorrect answer is correct... LOL)

I notice the same thing running Kimi locally, even though the quant is IQ2, I still perfer it to deepseek of IQ4 any day, it just has that much more information to reference inside of it. (if you haven't noticed by all of my huggingface comments, I'm a pro-kimi guy, I just adore that model)

For the next few weeks I've decided to keep this model, but at a lower quant to help speed up usability as I don't really use it for coding things atm, so I'm fine with little (big) loss of accuracy... besides less accuracy could mean more creativity (to an extent)...

Full GPU offload w/ IQ1 of this model is still coherent for chatting / general tasks, with 27 t/s generation on a fresh conversation. Context speed falloff is real, with generation speeds dropping to 17 t/s (probably to my q4 context quantization but still, to be expected)

ubergarm

Owner Oct 3

Interesting, honestly even with the perplexity difference I wonder if this model at a quant of IQ1 / IQ2 would perform the same if not better than an Air variant of a better quant yet the same filesize.

Yeah myself and some folks are wondering the same question. I gave a half answer tonight on reddit: https://old.reddit.com/r/LocalLLaMA/comments/1nwimej/comment/nhgbj2s/

theoretically the whole thinking process could help bring back some of the loss of accuracy just with how the thinking process helps refine an answer.

I've heard that said other places too, how possibly thinking can kind of recover damage done by quantization. I'm not sure really, because as you mention what if the thinking process itself is also damaged causing it to go confidently off the rails xD lol...

I notice the same thing running Kimi locally, even though the quant is IQ2, I still perfer it to deepseek of IQ4 any day,

You're not alone there, I've heard some folks on AI Beavers Discord say Kimi-K2 is their fav, but yeah it is so huge... ain't no body got enough GPUS lol...

Full GPU offload w/ IQ1 of this model is still coherent for chatting / general tasks, with 27 t/s generation on a fresh conversation. Context speed falloff is real, with generation speeds dropping to 17 t/s (probably to my q4 context quantization but still, to be expected)

awesome, thanks for the report as the iq1_kt is probably best for full offload on a RTX 6000 PRO or anything with enough VRAM. I kept the attn/shexp/first 3 dense layers pretty big and only smashed the routed experts hard so amazing it still works okay.

And yes, especially on GPU, f16 kv-cache is gonna probably be fastest. You can use other types in ik_llama.cpp for kv-cache quantization e.g. -ctk q4_1 -ctv q4_1 might be reasonably better than q4_0. There is also q6_0 available and often overlooked. You can even go wild and use stuff like -ctk iq4_kss -ctv iq4_kss but probably gonna drag down the speeds hah... Too many options!

BingoBird

Oct 3

That perplexity on the IQ3_KS is lookin nice and...

IQ3_KS looks to be 160GB, leaving about 14GB for kv cache on a dual 3090 128GB system.

... this.. might... be.... the best day ever.

sokann

21 days ago

Thanks for sharing the numbers. The amazing PP got me to do some research, and to come to the realization that PCIe speed absolutely matters for GPU offload during prompt processing. I posted about this at https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

sokann

21 days ago

Wendell needs to add a disclaimer when he does this haha: https://www.youtube.com/watch?v=ziZDzrDI7AM&t=484s

ubergarm

Owner 20 days ago

@sokann

Appreciate your reddit post, very thorough! I replied over there!

wunderschnitzel

19 days ago

Hi,

thank you @ubergarm for the great quants!

I'll add my results here, since mine is still technically a gaming rig, even if put to limits. Running a Ryzen 9900x on an Asus X870E creator and 192GB@6000T, and 96GB VRAM (32+24+24+16), on Xubuntu to squeeze every thousand of megabyte from the video cards.

First is IQ4_K at 32k context KV at fp16, it's quite usable for conversation, not really for coding given the slow TG.

./ik_llama.cpp/build/bin/llama-sweep-bench \
    --model /home/llm_models/GLM-4.6/IQ4_K/GLM-4.6-IQ4_K-00001-of-00005.gguf \
    --ctx-size 32768 \
    -fa -fmoe \
    -ngl 99 \
    -ot "blk\.[0-9]\.ffn.*=CUDA0" \
    -ot "blk\.1[0]\.ffn.*=CUDA0" \
    -ot "blk\.1[1-4]\.ffn.*=CUDA1" \
    -ot "blk\.1[5-9]\.ffn.*=CUDA2" \
    -ot "blk\.2[0]\.ffn.*=CUDA2" \
    -ot "blk\.2[1-6]\.ffn.*=CUDA3" \
    -ot "blk\\..*_exps\\.=CPU" \
    --no-mmap \
    --threads 11 \
    --parallel 1 \
    -b 4096 -ub 4096
    
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 11, n_threads_batch = 11

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   15.018 |   272.74 |  203.825 |     5.02 |
|  4096 |   1024 |   4096 |   15.041 |   272.33 |  208.365 |     4.91 |
|  4096 |   1024 |   8192 |   14.203 |   288.40 |  204.167 |     5.02 |
|  4096 |   1024 |  12288 |   14.348 |   285.48 |  199.351 |     5.14 |
|  4096 |   1024 |  16384 |   14.926 |   274.42 |  204.137 |     5.02 |
|  4096 |   1024 |  20480 |   15.996 |   256.07 |  210.028 |     4.88 |
|  4096 |   1024 |  24576 |   16.961 |   241.49 |  215.704 |     4.75 |
|  4096 |   1024 |  28672 |   18.034 |   227.13 |  220.709 |     4.64 |

Then smol_IQ4_KSS, this one it's pretty impressive, can run it at 65k context on fp 16 KV cache, and faster than the IQ4_K. I've noticed quantizing the KV cache on this model degrades the output, so I keep it at fp16 regardless the speed.

./ik_llama.cpp/build/bin/llama-sweep-bench \
    --model /home/llm_models/GLM-4.6/smol-IQ4_KSS/GLM-4.6-smol-IQ4_KSS-00001-of-00004.gguf \
    --ctx-size 65536 \
    -fa -fmoe \
    -ngl 99 \
    -ot "blk\.[0-9]\.ffn.*=CUDA0" \
    -ot "blk\.1[0]\.ffn.*=CUDA0" \
    -ot "blk\.1[1-4]\.ffn.*=CUDA1" \
    -ot "blk\.1[5-9]\.ffn.*=CUDA2" \
    -ot "blk\.2[0]\.ffn.*=CUDA2" \
    -ot "blk\.2[1-6]\.ffn.*=CUDA3" \
    -ot "blk\\..*_exps\\.=CPU" \
    --no-mmap \
    --threads 11 \
    --parallel 1 \
    -b 4096 -ub 4096

main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 11, n_threads_batch = 11

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   12.288 |   333.32 |  156.216 |     6.56 |
|  4096 |   1024 |   4096 |   12.361 |   331.37 |  159.770 |     6.41 |
|  4096 |   1024 |   8192 |   12.516 |   327.27 |  163.223 |     6.27 |
|  4096 |   1024 |  12288 |   12.925 |   316.90 |  167.688 |     6.11 |
|  4096 |   1024 |  16384 |   13.682 |   299.38 |  172.297 |     5.94 |
|  4096 |   1024 |  20480 |   14.648 |   279.62 |  177.650 |     5.76 |
|  4096 |   1024 |  24576 |   15.574 |   263.01 |  182.773 |     5.60 |
|  4096 |   1024 |  28672 |   16.550 |   247.50 |  187.176 |     5.47 |
|  4096 |   1024 |  32768 |   17.488 |   234.22 |  191.876 |     5.34 |
|  4096 |   1024 |  36864 |   18.404 |   222.56 |  196.346 |     5.22 |
|  4096 |   1024 |  40960 |   19.467 |   210.41 |  200.982 |     5.09 |
|  4096 |   1024 |  45056 |   20.525 |   199.56 |  205.287 |     4.99 |
|  4096 |   1024 |  49152 |   21.462 |   190.85 |  209.725 |     4.88 |
|  4096 |   1024 |  53248 |   22.304 |   183.65 |  214.356 |     4.78 |
|  4096 |   1024 |  57344 |   23.415 |   174.93 |  218.771 |     4.68 |
|  4096 |   1024 |  61440 |   24.869 |   164.70 |  227.546 |     4.50 |

ubergarm

Owner 19 days ago

•

edited 19 days ago

Then smol_IQ4_KSS, this one it's pretty impressive, can run it at 65k context on fp 16 KV cache, and faster than the IQ4_K. I've noticed quantizing the KV cache on this model degrades the output, so I keep it at fp16 regardless the speed.

Hey appreciate the report and you have a very capable rig! Yes the smol-IQ4_KSS does not use full sized q8_0 for the GPU layers so can fit more on VRAM and tends to be faster. I generally do this with the "KS"/"KSS" quants and keep the "_K" quants at full q8_0 for people that like that.

-ot "blk\\..*_exps\\.=CPU"

This regex has two extra backslash (at least how it is rendered on this website to me), but I guess its actually working? You can simply use -ot exps=CPU and it will do the trick a little more simple looking at least on this model. Some models like gpt-oss-120b does require more specific regex as mentioned in a recent ik_llama.cpp PR about that one.

And yes, myself and AesSedai have also noted that using q8_0 for kv-cache slows GLM-4.6 down noticeably for longer context TG! Good observation and glad to hear it confirmed in the wild!

I have a new https://huggingface.co/ubergarm/Ling-1T-GGUF that will fit your setup coming out soon too if huggingface gets back to me about the public file size quota! 😅

Have a great weekend!

ubergarm

Owner 17 days ago

@sokann

come to the realization that PCIe speed absolutely matters for GPU offload during prompt processing

So I don't have the right rig to test these advanced "offload policy" features of ik_llama.cpp, but basically you can specify which operations are allowed to offload or not. and folks with limited PCIe bandwidth have reported being able to improve TG by using different combinations e.g. llama-server ... -op 27,28,30,31 or the right combination could help your rig out in some situations. Note the numbers changed recently as pointed out in the first linked thread by ik below.

Some relevant reading if you'd like to research it more yourself:

Curious if you are able to get any meaningful boost one way or the other!