Can you explain the half-bit quants?

by Onceler - opened Oct 1

Oct 1

Very cool app, but can you help me understand what your half-bit quants are doing? What’s the advantage of the 6.5bit over a regular 6bit? Are you quantizing the different layers at different quant levels? Very curious to test this out.

inferencerlabs

Owner Oct 1

It's just the calculated average bit per weight.

This is how the function is computed in MLX
def compute_bits_per_weight(model):
model_bytes = tree_reduce(
lambda acc, x: acc + x.nbytes if isinstance(x, mx.array) else acc, model, 0
)
model_params = get_total_parameters(model)
return model_bytes * 8 / model_params

6bit with a group size of 64 is would be typically computed to 6.5 bits per weight.
6 bit with group sizes of 128 would be 6.252 bits per weight.
Mixing layers eg. with some as 4 and some as 6 would would get you 4.87 bits (depending on which ones you select and the group size).

For further info on the quants you can usually find a block inside the config.json, eg.
"quantization": {
"group_size": 64,
"bits": 6,
"mode": "affine"
},

Hope that helps.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment