Can you explain the half-bit quants?
Very cool app, but can you help me understand what your half-bit quants are doing? What’s the advantage of the 6.5bit over a regular 6bit? Are you quantizing the different layers at different quant levels? Very curious to test this out.
It's just the calculated average bit per weight.
This is how the function is computed in MLX
def compute_bits_per_weight(model):
model_bytes = tree_reduce(
lambda acc, x: acc + x.nbytes if isinstance(x, mx.array) else acc, model, 0
)
model_params = get_total_parameters(model)
return model_bytes * 8 / model_params
6bit with a group size of 64 is would be typically computed to 6.5 bits per weight.
6 bit with group sizes of 128 would be 6.252 bits per weight.
Mixing layers eg. with some as 4 and some as 6 would would get you 4.87 bits (depending on which ones you select and the group size).
For further info on the quants you can usually find a block inside the config.json, eg.
"quantization": {
"group_size": 64,
"bits": 6,
"mode": "affine"
},
Hope that helps.