the faster the inference speed becomes. Why is that?

#9
by lsm03624 - opened

With the same hardware configuration and parameters, the inference speed of the IQ2_KS quantization files is approximately twice that of the IQ1_KT files. Moreover, the larger the files are, the faster the inference speed becomes. Why is that?

Because IQ1_KT is a "trellis" quant similar to QTIP/EXL3 and is CPU intensive to compute during TG.

I've said it in many other places, but in general TG is RAM bandwidth bottle-necked unless you're running a KT quant in which case TG likely becomes CPU compute bottle-necked.

The fact it runs on CPU at all is really amazing, given other implementations tend to require enough VRAM to run it GPU only.

So the IQ1_KT provides the best quality model available that fits into that small amount of RAM but will take more CPU to compute TG.

Sign up or log in to comment