World's smallest working R1-0528 Quant!

With recent PRs from ik_llama.cpp, you can now run full DeepSeek-R1-0528
671B model in 128GB RAM + 24GB VRAM! What a time to be alive!

Files changed (2) hide show

README.md +21 -8
images/kld-r1-0528-smol-bois.png +3 -0

README.md CHANGED Viewed

@@ -45,14 +45,18 @@ So far these are my best recipes offering the lowest perplexity per GiB models s
   - `Final estimate: PPL = 3.5069 +/- 0.01893`
   - Fits 32k context in under 16GiB VRAM
   - Fits 64k context in under 24GiB VRAM
-* `DeepSeek-R1-0528-IQ1_S` 134GiB
-  - `Final estimate: PPL = 4.8831 +/- 0.02878`
   - Fits 32k+ context in under 16GiB VRAM
   - Should fit in 128GiB RAM + 24GB VRAM by offloading layers to GPU.
-  - *Don't use the old `IQ1_S_R4` if you need to offload to GPU!*
   - "Only for the desperate."
   - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
-* I might try an `iqN_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and takes a long time to cook and slow on CPU inference...
 #### `IQ4_KS_R4` 4.701 BPW (368GiB)
 Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
@@ -307,9 +311,20 @@ custom=$(
 </details>
-#### `IQ1_S` 133.063 GiB (1.701 BPW)
-Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 <details>
@@ -354,8 +369,6 @@ llama_new_context_with_model:  CUDA_Host compute buffer size =    78.01 MiB
 </details>
-*NOTE*: Probably don't use the similar sized repacked version `IQ1_S_R4` 1.664 BPW (131GiB) as it can't run on GPU so only if you are doing CPU only or know what you're doing specifically e.g. having over 128GB RAM.
 <details>
 ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")

   - `Final estimate: PPL = 3.5069 +/- 0.01893`
   - Fits 32k context in under 16GiB VRAM
   - Fits 64k context in under 24GiB VRAM
+* `DeepSeek-R1-0528-IQ1_S_R4` 131GiB
+  - `Final estimate: PPL = 4.8805 +/- 0.02876`
+  - The world's smallest working DeepSeek-R1-0528 Quant!
+  - Run on AM5 class gaming rig with 2x64GB DDR5 DIMM kit and single GPU!
+  - Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)
   - Fits 32k+ context in under 16GiB VRAM
   - Should fit in 128GiB RAM + 24GB VRAM by offloading layers to GPU.
   - "Only for the desperate."
   - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
+#### TODO
+I might release my `iq2_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and the inferencing implementation needs more time to bake.
 #### `IQ4_KS_R4` 4.701 BPW (368GiB)
 Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 </details>
+#### `IQ1_S_R4` 130.203 GiB (1.664 BPW)
+The world's smallest working DeepSeek-R1-0528 quant!
+![KLD Smol Boi Comparison](images/kld-r1-0528-smol-bois.png "Chart showing competitive KLD quality of smallest R1-0528 quants.")
+The Delta P numbers for average RMS, 99% percentile, and absolute max divergence from the baseline pure `Q8_0`. Lower is better.
+If you can fit a larger model completely in RAM+VRAM I would recommend
+that, but if you have 128GB RAM + 24GB VRAM then give this a try as it
+is surprisingly usable despite heavy quantization.
+Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)!
+Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` (only appleis to the `iq4_ks` tensors etc.).
 <details>
 </details>
 <details>
 ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")

images/kld-r1-0528-smol-bois.png ADDED Viewed

Git LFS Details

SHA256: 37287f8bb732ecf623a63a3d0cc67c171349cf73cfb5317f4cfdb0f16f64bac2
Pointer size: 131 Bytes
Size of remote file: 129 kB