a few questions
I started making versions that fit my PC and you probably know a lot more about all this by now. I just have a few questions you probably know better at this point:
- I have been using 16bf gguf for quanting but I suppose safetensor 16bf will be the same?
- Imatrix isn't interchangable for regular llamacpp and ikllama? And making it still needs 800GB's of ram? Or page swapping would work? I am making quants on my 7800X3D 192GB and I am pretty sure it page swaps + takes 16 hours (you can use it for your guides since you had only your threadripper estimate).
I have been using 16bf gguf for quanting but I suppose safetensor 16bf will be the same?
You will need a bf16 GGUF to do all quantizing with ik/llama.cpp. I describe a few methods to get either the og MLA style with attn_kv_b or the newer mainline style without attn_kv_b here: https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/discussions/1#68a9cbd78d7a0473ba5b3c8e
Imatrix isn't interchangable for regular llamacpp and ikllama?
They are kind of interchangeable for non-MLA quants. But for MLA quants, you will need the matching kind with or without attn_kv_b. I have provided both style imatrix in this repo. The current one is for mainline style without attn_kv_b. If you want the old one, see the link in the above provided link (the first one i uploaded here).
Cheers and Good luck!