Error tensor 'blk.92.nextn.embed_tokens.weight' not found
Sadly I can't get this to load with ik_llama.cpp
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.92.nextn.embed_tokens.weight' not found
Ah, I got the same error trying another quant with llama.cpp, I just had to update it. Looks like ik_llama.cpp just needs an update too.
Yep, there's already an issue: https://github.com/ikawrakow/ik_llama.cpp/issues/812
EDIT: I think I can fix it.
Temp fix here, until the PR is merged: https://github.com/Downtown-Case/ik_llama.cpp
Works on my machine :P
Hi guys, could you advise what would be the recipe to run this with 2 or 4 3090s? ( i have 4 but Ideally I want to keep 2 free for other models ), also I have 512GB of DDR4 RAM, so perhaps it's possible to run also larger quants at the same speed? What would you advise?
I guess it depends what kind of speeds you want? And what your max context size is.
A 'sweet spot,' for your config, to me, would be something like IQ4_KS_R4 CPU FFN layers (relatively fast, not too lossy), and either IQ4_KT or IQ3_KT GPU layers depending on if you prefer speed (by cramming more layers onto the 3090s) or quality (as IQ4_KT GPU layers are quite low-loss).
Thanks for the quick response!
Yeah so I'm looking for a HQ Coder, to be used alongside my Faster coder ( Qwen3-Coder-30B ) - in an agentic loop, to be sort of like a mentor, to guide and oversee the little guy. Way I figure, the fast coder can work much closer to larger models in quality, if a larger model gave it suggestions before (pre-prompt) or after ( as QA in an Agentic loop ).
So in this case, requirements are:
- The larger model would not have to be as accurate, as it's not going to write the code, perhaps only Psudo-code, as intructions for the faster little guy.
- It would need to be able to execute agentic tool calling reliably to call upon MCP servers, tool-chain calls, etc.
- It would need to process prompt tokens relatively fast as it will get the full context ( ideally 256K, but no less than 128K) - so the faster the better I suppose, definitely in the 100's not 10's of t/s.
- It would not be expected to output much, so eval or inference token generation speed is not critical, but Ideally should be closer to 10 t/s.
Bottom line - Speed is more important than quality in this particular case, because the big model is not expected to write code flawlessly, so much so as it is expected to Reason much better about code, relatively to a 30B model.
What do you think?
P.S. If I could run the full model in GPU only ( 96GB total ) I would, as it would be plenty fast, I would have to source more GPUs for the other models but I don't think there's a quant that goes low enough, plus even for my relatively relaxed use case, which permits lossy performance, still it's likely going to be too degraded at Q1-Q2 to be of any practical use anyway. I probably have to figure out a way to get Q3+ working in some CPU/GPU offloading scenario only.
Oh yeah, you are asking a lot there, particularly with 200K context. Prompt processing is not fast with this setup, and TBH I'd consider capping it to 128K where the model will be stronger anyway (as these models always get flakey near the edge of their trained context, especially when heavily quantized).
'Sparser' models like Air, GPT-OSS 120B or some of the other chinese coding finetunes (like Ring) that have come out recently may be better for your use case, TBH. Or maybe Even Qwen3 Next via VLLM?
I would have to source more GPUs for the other models but I don't think there's a quant that goes low enough
As an aside, is Qwen 30B taking up the other 3090s?
TabbyAPI + a dynamic ~4.6bpw exl3 quant (4bpw FFNs, like 5-6bpw attention, 8bpw lm_head, maybe quantized KV cache) will squeeze into a single 3090, run quickly, and still be extremely close to 8-bit weights. And doing what you ask would be much easier in 3x 3090s than 2.
@BiggestFox I've been using --n-cpu-moe and adjusting the tensor split numbers until I can fit as much into VRAM as possible. This is what I was playing with recently for 3x24GB GPUs
/home/user/ik_llama.cpp/build/bin/llama-server --n-gpu-layers 999 --n-cpu-moe 68 -fa --ctx-size 100000 --reasoning-format auto -m /home/user/ai/models/GLM-4.6-IQ2_KL/GLM-4.6-IQ2_KL-BIG-00001-of-00008.gguf --host 0.0.0.0 --port 6969 --api-key "redacted" --temp 0.6 --top-p 0.95 --min-p 0.005 --top-k 20 --threads 8 -ub 2048 -b 2048 --cache-type-k q8_0 --cache-type-v q8_0 -fmoe --parallel 1 --mlock --no-mmap --tensor-split 59,22,13
Hmmm. You could do something similar and potentially alter the k/v cache quantization; v tends to be less sensitive, and there's a new IQ4_NL option I haven't played with. So maybe Q8_0/Q5_1 or Q5_1/IQ4_NL would be squeeze in more context and layers?
batch size 4096 also makes a huge difference for prompt processing with CPU offloading.
Otherwise... Going smaller may be the way? With mostly IQ2_KT and some IQ3_KT until like layer 70 or so (assuming you're currently offload everything past 59 to CPU on 3x 3090s). The 'QT' quants aren't so catastrophic for the bpw, with the big catch being that they're slower on CPU.
Also, FYI, the fix has been pulled into mainline llama.cpp
P.S. If I could run the full model in GPU only ( 96GB total ) I would, as it would be plenty fast, I would have to source more GPUs for the other models but I don't think there's a quant that goes low enough, plus even for my relatively relaxed use case, which permits lossy performance, still it's likely going to be too degraded at Q1-Q2 to be of any practical use anyway. I probably have to figure out a way to get Q3+ working in some CPU/GPU offloading scenario only.
TL;DR, but you could fit an exl3 quant with 96GB if I'm not mistaken.
Let me tell you these are great! And thanks for the quant explanations - using V4 on my 3090 + 128 RAM with great success!
Good! Glad it’s working!
FYI experimentations not stalled; I will be afk for ~2 weeks, and I intend to follow up on the per-layer tests then. But I will still be around to reply on HF.
Thanks for the this (V4 is great) and the KV quant tests