where are your lower quants?

by jc2375 - opened Sep 30

Sep 30

It would be a boon to the community if you released your MLX quants that are lower size (3.5 bit getting 1.8 of perplexity!) for those with smaller Macs to use!

inferencerlabs

Owner Sep 30

Ok sure, just let me know how much RAM you have and I can target one for you. In the meantime you can always run the higher quants (albeit slower) with memory offloading.

jc2375

Sep 30

Im using an M2 ultra 192gb. I know 3 bit will fit (about 155gb), wondering if 3.5 bpw will be too big??

inferencerlabs

Owner Oct 1

Great point, you're probably at your upper limit. Have you tried the one uploaded by mrtools yet?
https://huggingface.co/mrtoots/GLM-4.6-mlx-3Bit

jc2375

Oct 1

I did, about 16 tps inference speed. Im sure the perplexity difference between 3 bits and 3.5 bits is far larger though. It wrote me a story about a parrot that orders a mariachi costume and sings la cucaracha. Not sure about the actual agentic and coding abilities yet 🙃

inferencerlabs

Owner Oct 1

Nice one. According to the config.json on there, it mentions using group_size:64 with bits:3, which calculates to be 3.5 bits per weight.

The next step up in this case would be using a mixed quant, but it might not be worth it as you mentioned that the 3.5bit version is running well for you.

x-polyglot-x

26 days ago

Ok sure, just let me know how much RAM you have and I can target one for you. In the meantime you can always run the higher quants (albeit slower) with memory offloading.

Hey! I like your app and quants. I'm on an M4 Max 128gb. I've never understood how to use memory offloading (memory mapping from disk) with MLX. Can you explain how to configure that?! Thank you!

inferencerlabs

Owner 25 days ago

Thanks, it's not something configurable as part of MLX yet.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment