Definitely does NOT run on dual RTX PRO 6000 96GB
I can confirm the model won't load on a dual RTX PRO 60000 rig. Which is a real bummer. I'm still hoping for an actual 'Air' model that will fit on my setup. GLM 4.5 Air is a kickass model but 4.6 is a more competent coder. I'm able to load the GLM 4.6 Q2 quant but it's not great. Thanks for posting this one though. I was hopeful, but failed to run it.
Huh? I can run this in 4.25bpw.. was like 128gb. Are you loading full context? I have half your vram. On your system should be able to do bnb-4bit from these weights themselves.
The context I tried was 128K which is really the bare minimum, as it fills up very quickly when coding, plus the "compact" takes far too long when it gets anywhere near ~70k. Also anything above ~70k context and things get extremely slow. It's always a struggle between losing context, running out of room, or running into a performance wall. I was able to run the cerebras 4.5, using bits and bytes. I posted about it here:
https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B/discussions/11
With full ctx, it barely fits.
Yea.. I mean that's air. Different class of model. With this much context you're stuck with GGUF/EXL3/AWQ and harder quantizations. Bare minimum for me is 32k. Attention over 128k is poor anyway, imo. You're using VLM so you can halve the cache to FP8, could work for you.