ubergarm commited on
Commit
88495ef
·
1 Parent(s): f740b23

World's smallest working R1-0528 Quant!

Browse files

With recent PRs from ik_llama.cpp, you can now run full DeepSeek-R1-0528
671B model in 128GB RAM + 24GB VRAM! What a time to be alive!

Files changed (2) hide show
  1. README.md +21 -8
  2. images/kld-r1-0528-smol-bois.png +3 -0
README.md CHANGED
@@ -45,14 +45,18 @@ So far these are my best recipes offering the lowest perplexity per GiB models s
45
  - `Final estimate: PPL = 3.5069 +/- 0.01893`
46
  - Fits 32k context in under 16GiB VRAM
47
  - Fits 64k context in under 24GiB VRAM
48
- * `DeepSeek-R1-0528-IQ1_S` 134GiB
49
- - `Final estimate: PPL = 4.8831 +/- 0.02878`
 
 
 
50
  - Fits 32k+ context in under 16GiB VRAM
51
  - Should fit in 128GiB RAM + 24GB VRAM by offloading layers to GPU.
52
- - *Don't use the old `IQ1_S_R4` if you need to offload to GPU!*
53
  - "Only for the desperate."
54
  - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
55
- * I might try an `iqN_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and takes a long time to cook and slow on CPU inference...
 
 
56
 
57
  #### `IQ4_KS_R4` 4.701 BPW (368GiB)
58
  Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
@@ -307,9 +311,20 @@ custom=$(
307
 
308
  </details>
309
 
310
- #### `IQ1_S` 133.063 GiB (1.701 BPW)
 
 
 
 
 
311
 
312
- Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 
 
 
 
 
 
313
 
314
  <details>
315
 
@@ -354,8 +369,6 @@ llama_new_context_with_model: CUDA_Host compute buffer size = 78.01 MiB
354
 
355
  </details>
356
 
357
- *NOTE*: Probably don't use the similar sized repacked version `IQ1_S_R4` 1.664 BPW (131GiB) as it can't run on GPU so only if you are doing CPU only or know what you're doing specifically e.g. having over 128GB RAM.
358
-
359
  <details>
360
 
361
  ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")
 
45
  - `Final estimate: PPL = 3.5069 +/- 0.01893`
46
  - Fits 32k context in under 16GiB VRAM
47
  - Fits 64k context in under 24GiB VRAM
48
+ * `DeepSeek-R1-0528-IQ1_S_R4` 131GiB
49
+ - `Final estimate: PPL = 4.8805 +/- 0.02876`
50
+ - The world's smallest working DeepSeek-R1-0528 Quant!
51
+ - Run on AM5 class gaming rig with 2x64GB DDR5 DIMM kit and single GPU!
52
+ - Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)
53
  - Fits 32k+ context in under 16GiB VRAM
54
  - Should fit in 128GiB RAM + 24GB VRAM by offloading layers to GPU.
 
55
  - "Only for the desperate."
56
  - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
57
+
58
+ #### TODO
59
+ I might release my `iq2_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and the inferencing implementation needs more time to bake.
60
 
61
  #### `IQ4_KS_R4` 4.701 BPW (368GiB)
62
  Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
 
311
 
312
  </details>
313
 
314
+ #### `IQ1_S_R4` 130.203 GiB (1.664 BPW)
315
+
316
+ The world's smallest working DeepSeek-R1-0528 quant!
317
+
318
+ ![KLD Smol Boi Comparison](images/kld-r1-0528-smol-bois.png "Chart showing competitive KLD quality of smallest R1-0528 quants.")
319
+ The Delta P numbers for average RMS, 99% percentile, and absolute max divergence from the baseline pure `Q8_0`. Lower is better.
320
 
321
+ If you can fit a larger model completely in RAM+VRAM I would recommend
322
+ that, but if you have 128GB RAM + 24GB VRAM then give this a try as it
323
+ is surprisingly usable despite heavy quantization.
324
+
325
+ Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)!
326
+
327
+ Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` (only appleis to the `iq4_ks` tensors etc.).
328
 
329
  <details>
330
 
 
369
 
370
  </details>
371
 
 
 
372
  <details>
373
 
374
  ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")
images/kld-r1-0528-smol-bois.png ADDED

Git LFS Details

  • SHA256: 37287f8bb732ecf623a63a3d0cc67c171349cf73cfb5317f4cfdb0f16f64bac2
  • Pointer size: 131 Bytes
  • Size of remote file: 129 kB