Could you please upload a 99GB-100GB version of the MLX quantization model so that it can be deployed locally on a 128GB RAM MAC? Thank you very much!

#2
by mimeng1990 - opened

Could you please upload a 99GB-100GB version of the MLX quantization model so that it can be deployed locally on a 128GB RAM MAC? Thank you very much!

Sure thing, does that mean the model is still a good coder? I never got to evaluate the performance, that's why I only uploaded a small quant

First, I would like to express my respect and gratitude. The model at https://model.lmstudio.ai/download/nightmedia/LIMI-Air-qx86-hi-mlx is 97GB and works very well on a Mac with 128GB of RAM. However, I would also like to try a debugged REAP model. If the quantization size is between 97GB and 100GB, a local deployment would be more comparable. 4-bit quantization is too small, clearly resulting in a significant performance loss; I've tested it, and the text processing results are not ideal. Similarly, 6-bit quantization is also not ideal. In my opinion, LIMI-Air-qx86-hi-mlx is currently the most ideal model for processing Chinese text on a Mac with 128GB of RAM. The output results seem better than QWEN3-80b.

I am surprised, expected the 80B to be better at coding. Did you try one of the qx86n-hi quants I made of it?

I am uploading now the GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx

This is the equivalent of the LIMI-Air-qx86-hi-mlx with a few corrections to the quant formula, so it should perform slightly better, given the trimmed down experts. One of the issues with GLM was noise in the inference, that made the hi quants less effective, but in the qx86g I enhanced some paths that were losing ground. There was no room in the previous model to add these and still fit in 128GB

Let me know how it performs. After REAP the model is 74.89 GB, leaving you sufficient room for context. It is also faster than a comparable q8 quant, and significantly better than the smaller one.

Thank you very much. I will download and try the GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx as soon as it is available, and then compare it with a piece of content processed by LIMI-Air-qx86-hi-mlx to see which output is more perfect. Will the GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx suffer a performance loss if it is only 74.89 GB? I prefer to download a model between 93 GB and 99.9 GB to fully utilize my 128 GB RAM Mac. A 128 GB RAM Mac has 107.52 GB of VRAM available, so a quantized model of around 93 GB is very suitable.

Try first this quant, and if it still comes short I will upload the q8-hi, that is guaranteed to fill your RAM :)

Thank you very much. I will download and try the GLM-4.5-Air-REAP-82B-A12B-qx86ghi-mlx as soon as it is available, and then compare it with a piece of content processed by LIMI-Air-qx86-hi-mlx to see which output is more perfect. Will the GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx suffer a performance loss if it is only 74.89 GB? I prefer to download a model between 93 GB and 99.9 GB to fully utilize my 128 GB RAM Mac. A 128 GB RAM Mac has 107.52 GB of VRAM available, so a quantized model of around 93 GB is very suitable.

I would really not expect the prune to have better function, based on quantization. The differences you are asking about between 6 and 8 bit quants are not significant, but the pruning is.

Completely agree with that, the prune did reduce functionality and abilities--question is whether by 25% or more :)

First, I would like to express my deepest respect and gratitude. I have downloaded many Nightmedia models. The large number of quantization models provided by Nightmedia has facilitated my various experiments. I downloaded and tested the GLM 4.5 Air REAP 82B A12B Qx86g Hi, processing a Chinese document with 18,000 tokens. The processing results were not as good as qwen3 80b and nightmedia/LIMI-Air-qx86-hi-mlx. After actual testing, I felt that the GLM 4.5 Air REAP 82B A12B Qx86g Hi was indeed much faster than LIMI-Air-qx86-hi-mlx, but its speed and text accuracy were still not as good as qwen3 80b and LIMI-Air-qx86-hi-mlx. In actual testing, I felt that qwen3 80b sometimes output too simple content, as if it gave a result without proper thought. LIMI-Air-qx86-hi-mlx was just average. I think it's unnecessary to continue quantizing the GLM 4.5 Air REAP 82B A12B 93GB model. It might not be suitable for me. If other departments have specific requirements, we can consider their needs. Thanks again.

Sign up or log in to comment