the faster the inference speed becomes. Why is that?

by lsm03624 - opened 4 days ago

4 days ago

With the same hardware configuration and parameters, the inference speed of the IQ2_KS quantization files is approximately twice that of the IQ1_KT files. Moreover, the larger the files are, the faster the inference speed becomes. Why is that?

ubergarm

Owner 3 days ago

Because IQ1_KT is a "trellis" quant similar to QTIP/EXL3 and is CPU intensive to compute during TG.

I've said it in many other places, but in general TG is RAM bandwidth bottle-necked unless you're running a KT quant in which case TG likely becomes CPU compute bottle-necked.

The fact it runs on CPU at all is really amazing, given other implementations tend to require enough VRAM to run it GPU only.

So the IQ1_KT provides the best quality model available that fits into that small amount of RAM but will take more CPU to compute TG.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment