ubergarm
/

DeepSeek-R1-0528-GGUF

@@ -55,9 +55,6 @@ So far these are my best recipes offering the lowest perplexity per GiB models s
   - "Only for the desperate."
   - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
-#### TODO
-I might release my `iq2_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and the inferencing implementation needs more time to bake.
 #### `IQ4_KS_R4` 4.701 BPW (368GiB)
 Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
@@ -143,7 +140,6 @@ I'm still testing this out, but initial test am seeing ~12 tok/sec with 256GB RA
 Feel free to report in the comments section your configuration for others to see too. Thanks!
 ```bash
-    -ts 48,48 \
     --n-gpu-layers 63 \
     -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
     -ot "blk\.(8|9|10|11|12)\.ffn_.*=CUDA1" \
@@ -311,6 +307,9 @@ custom=$(
 </details>
 #### `IQ1_S_R4` 130.203 GiB (1.664 BPW)
 The world's smallest working DeepSeek-R1-0528 quant!
@@ -318,14 +317,14 @@ The world's smallest working DeepSeek-R1-0528 quant!
 ![KLD Smol Boi Comparison](images/kld-r1-0528-smol-bois.png "Chart showing competitive KLD quality of smallest R1-0528 quants.")
 The Delta P numbers for average RMS, 99% percentile, and absolute max divergence from the baseline pure `Q8_0`. Lower is better.
-If you can fit a larger model completely in RAM+VRAM I would recommend
-that, but if you have 128GB RAM + 24GB VRAM then give this a try as it
-is surprisingly usable despite heavy quantization.
 Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)!
 Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` (only appleis to the `iq4_ks` tensors etc.).
 <details>
 <summary>👈 How to run in 128GiB RAM + 24GB VRAM</summary>
@@ -338,10 +337,11 @@ Keep in mind if you can fit the next size up it will likely actually run faster
 This will fit in ~116.1GiB RAM plus 22448MiB VRAM. You can strip it down more and get another layer on GPU possibly too or increase context. Good luck!
 ```bash
 CUDA_VISIBLE_DEVICES="0" \
 ./build/bin/llama-server \
-    --model /mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S/DeepSeek-R1-0528-IQ1_S-00001-of-00003.gguf \
-    --alias ubergarm/DeepSeek-R1-0528-IQ1_S \
     --ctx-size 32768 \
     -ctk q8_0 \
     -mla 3 -fa \
@@ -371,12 +371,14 @@ llama_new_context_with_model:  CUDA_Host compute buffer size =    78.01 MiB
 <details>
-![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")
 Possibly useful for 128GiB RAM + 16GB+ VRAM? Maybe? It does actually work and can read python code okay. For all I know it might be better than Qwen3-235B-A22B given the iq1_s_r4 actually has lower PPL!
 Not recommended and slower than a larger quant unless this is the *only* thing you can fit completely in RAM+VRAM as this quant seems slower and less optimized for inferencing and in testing has slower TG and worse quality (higher perplexity). Plus I'm not sure that you can use it with multi-GPU offload so check the ik_llama.cpp PRs as these tiny quants are less used.
 <summary>👈 Secret Recipe</summary>
 ```bash
@@ -488,7 +490,6 @@ CUDA_VISIBLE_DEVICES="0," \
     -amb 512 \
     -fmoe \
     --n-gpu-layers 63 \
-    -ts 24,24 \
     -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
     -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
     --override-tensor exps=CPU \
@@ -576,7 +577,6 @@ $ ./build/bin/llama-perplexity \
     -mla 3 -fa \
     -amb 512 \
     -fmoe \
-    -ts 48,48 \
     --n-gpu-layers 63 \
     -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
     -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \

   - "Only for the desperate."
   - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
 #### `IQ4_KS_R4` 4.701 BPW (368GiB)
 Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 Feel free to report in the comments section your configuration for others to see too. Thanks!
 ```bash
     --n-gpu-layers 63 \
     -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
     -ot "blk\.(8|9|10|11|12)\.ffn_.*=CUDA1" \
 </details>
+#### `IQ2_KT` Not Yet Released
+I might release my `iq2_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and the inferencing implementation needs more time to bake. Quality wise it is slightly smaller than the above `IQ2_K_R4` with slightly worse perplexity and KLD.
 #### `IQ1_S_R4` 130.203 GiB (1.664 BPW)
 The world's smallest working DeepSeek-R1-0528 quant!
 ![KLD Smol Boi Comparison](images/kld-r1-0528-smol-bois.png "Chart showing competitive KLD quality of smallest R1-0528 quants.")
 The Delta P numbers for average RMS, 99% percentile, and absolute max divergence from the baseline pure `Q8_0`. Lower is better.
+If you can fit a larger model completely in RAM+VRAM I would recommend that, but if you have 128GB RAM + 24GB VRAM then give this a try as it is surprisingly usable despite heavy quantization.
 Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)!
 Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` (only appleis to the `iq4_ks` tensors etc.).
+Also released [ubergarm/DeepSeek-V3-0324-IQ1_S_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ1_S_R4) with the same recipe and size if you don't want thinking.
 <details>
 <summary>👈 How to run in 128GiB RAM + 24GB VRAM</summary>
 This will fit in ~116.1GiB RAM plus 22448MiB VRAM. You can strip it down more and get another layer on GPU possibly too or increase context. Good luck!
 ```bash
+# You can use more CUDA devices just set them all visibile and do *not* use `-ts ...` with this `-ot ...` strategy.
 CUDA_VISIBLE_DEVICES="0" \
 ./build/bin/llama-server \
+    --model /mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
+    --alias ubergarm/DeepSeek-R1-0528-IQ1_S_R4 \
     --ctx-size 32768 \
     -ctk q8_0 \
     -mla 3 -fa \
 <details>
+![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s_r4 quant.")
 Possibly useful for 128GiB RAM + 16GB+ VRAM? Maybe? It does actually work and can read python code okay. For all I know it might be better than Qwen3-235B-A22B given the iq1_s_r4 actually has lower PPL!
 Not recommended and slower than a larger quant unless this is the *only* thing you can fit completely in RAM+VRAM as this quant seems slower and less optimized for inferencing and in testing has slower TG and worse quality (higher perplexity). Plus I'm not sure that you can use it with multi-GPU offload so check the ik_llama.cpp PRs as these tiny quants are less used.
+I recommend to *not* use the `IQ1_S` so use the `IQ1_S_R4` now with the recent updates supporting GPU offload and better speeds with the repacked quant on CUDA.
 <summary>👈 Secret Recipe</summary>
 ```bash
     -amb 512 \
     -fmoe \
     --n-gpu-layers 63 \
     -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
     -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
     --override-tensor exps=CPU \
     -mla 3 -fa \
     -amb 512 \
     -fmoe \
     --n-gpu-layers 63 \
     -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
     -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \