where are your lower quants?
It would be a boon to the community if you released your MLX quants that are lower size (3.5 bit getting 1.8 of perplexity!) for those with smaller Macs to use!
Ok sure, just let me know how much RAM you have and I can target one for you. In the meantime you can always run the higher quants (albeit slower) with memory offloading.
Im using an M2 ultra 192gb. I know 3 bit will fit (about 155gb), wondering if 3.5 bpw will be too big??
Great point, you're probably at your upper limit. Have you tried the one uploaded by mrtools yet?
https://huggingface.co/mrtoots/GLM-4.6-mlx-3Bit
I did, about 16 tps inference speed. Im sure the perplexity difference between 3 bits and 3.5 bits is far larger though. It wrote me a story about a parrot that orders a mariachi costume and sings la cucaracha. Not sure about the actual agentic and coding abilities yet π
Nice one. According to the config.json on there, it mentions using group_size:64 with bits:3, which calculates to be 3.5 bits per weight.
The next step up in this case would be using a mixed quant, but it might not be worth it as you mentioned that the 3.5bit version is running well for you.
Ok sure, just let me know how much RAM you have and I can target one for you. In the meantime you can always run the higher quants (albeit slower) with memory offloading.
Hey! I like your app and quants. I'm on an M4 Max 128gb. I've never understood how to use memory offloading (memory mapping from disk) with MLX. Can you explain how to configure that?! Thank you!
Thanks, it's not something configurable as part of MLX yet.