ubergarm commited on
Commit
7c08a2c
·
1 Parent(s): 50eb2a9

Note ik_llama.cpp can run your existing GGUFs

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -16,6 +16,8 @@ tags:
16
 
17
  This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
18
 
 
 
19
  These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
20
 
21
  These quants are specifically designed for CPU+GPU systems with 24-48GB VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput). If you have more VRAM, I suggest a different quant with at least some routed expert layers optimized for GPU offload.
 
16
 
17
  This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
18
 
19
+ *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
20
+
21
  These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
22
 
23
  These quants are specifically designed for CPU+GPU systems with 24-48GB VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput). If you have more VRAM, I suggest a different quant with at least some routed expert layers optimized for GPU offload.