ubergarm commited on
Commit
076fc03
·
1 Parent(s): 0b7aeed

Update readme for multiGPU and IQ1_S_R4

Browse files

Now available DeepSeek-V3-0324-IQ1_S_R4 in that huggingface repo as
well!

Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -55,9 +55,6 @@ So far these are my best recipes offering the lowest perplexity per GiB models s
55
  - "Only for the desperate."
56
  - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
57
 
58
- #### TODO
59
- I might release my `iq2_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and the inferencing implementation needs more time to bake.
60
-
61
  #### `IQ4_KS_R4` 4.701 BPW (368GiB)
62
  Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
63
 
@@ -143,7 +140,6 @@ I'm still testing this out, but initial test am seeing ~12 tok/sec with 256GB RA
143
  Feel free to report in the comments section your configuration for others to see too. Thanks!
144
 
145
  ```bash
146
- -ts 48,48 \
147
  --n-gpu-layers 63 \
148
  -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
149
  -ot "blk\.(8|9|10|11|12)\.ffn_.*=CUDA1" \
@@ -311,6 +307,9 @@ custom=$(
311
 
312
  </details>
313
 
 
 
 
314
  #### `IQ1_S_R4` 130.203 GiB (1.664 BPW)
315
 
316
  The world's smallest working DeepSeek-R1-0528 quant!
@@ -318,14 +317,14 @@ The world's smallest working DeepSeek-R1-0528 quant!
318
  ![KLD Smol Boi Comparison](images/kld-r1-0528-smol-bois.png "Chart showing competitive KLD quality of smallest R1-0528 quants.")
319
  The Delta P numbers for average RMS, 99% percentile, and absolute max divergence from the baseline pure `Q8_0`. Lower is better.
320
 
321
- If you can fit a larger model completely in RAM+VRAM I would recommend
322
- that, but if you have 128GB RAM + 24GB VRAM then give this a try as it
323
- is surprisingly usable despite heavy quantization.
324
 
325
  Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)!
326
 
327
  Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` (only appleis to the `iq4_ks` tensors etc.).
328
 
 
 
329
  <details>
330
 
331
  <summary>👈 How to run in 128GiB RAM + 24GB VRAM</summary>
@@ -338,10 +337,11 @@ Keep in mind if you can fit the next size up it will likely actually run faster
338
 
339
  This will fit in ~116.1GiB RAM plus 22448MiB VRAM. You can strip it down more and get another layer on GPU possibly too or increase context. Good luck!
340
  ```bash
 
341
  CUDA_VISIBLE_DEVICES="0" \
342
  ./build/bin/llama-server \
343
- --model /mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S/DeepSeek-R1-0528-IQ1_S-00001-of-00003.gguf \
344
- --alias ubergarm/DeepSeek-R1-0528-IQ1_S \
345
  --ctx-size 32768 \
346
  -ctk q8_0 \
347
  -mla 3 -fa \
@@ -371,12 +371,14 @@ llama_new_context_with_model: CUDA_Host compute buffer size = 78.01 MiB
371
 
372
  <details>
373
 
374
- ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s quant.")
375
 
376
  Possibly useful for 128GiB RAM + 16GB+ VRAM? Maybe? It does actually work and can read python code okay. For all I know it might be better than Qwen3-235B-A22B given the iq1_s_r4 actually has lower PPL!
377
 
378
  Not recommended and slower than a larger quant unless this is the *only* thing you can fit completely in RAM+VRAM as this quant seems slower and less optimized for inferencing and in testing has slower TG and worse quality (higher perplexity). Plus I'm not sure that you can use it with multi-GPU offload so check the ik_llama.cpp PRs as these tiny quants are less used.
379
 
 
 
380
  <summary>👈 Secret Recipe</summary>
381
 
382
  ```bash
@@ -488,7 +490,6 @@ CUDA_VISIBLE_DEVICES="0," \
488
  -amb 512 \
489
  -fmoe \
490
  --n-gpu-layers 63 \
491
- -ts 24,24 \
492
  -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
493
  -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
494
  --override-tensor exps=CPU \
@@ -576,7 +577,6 @@ $ ./build/bin/llama-perplexity \
576
  -mla 3 -fa \
577
  -amb 512 \
578
  -fmoe \
579
- -ts 48,48 \
580
  --n-gpu-layers 63 \
581
  -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
582
  -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \
 
55
  - "Only for the desperate."
56
  - Technically "better" (lower) PPL than `Qwen3-235B-A22B-Q8_0 @ ~5.31` though you can't really make comparisons like this.
57
 
 
 
 
58
  #### `IQ4_KS_R4` 4.701 BPW (368GiB)
59
  Special mix `IQ5_KS_R4` `ffn_down` and `IQ4_KS_R4` `ffn_(up|gate)` routed experts. All other layers `q8_0` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack`.
60
 
 
140
  Feel free to report in the comments section your configuration for others to see too. Thanks!
141
 
142
  ```bash
 
143
  --n-gpu-layers 63 \
144
  -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
145
  -ot "blk\.(8|9|10|11|12)\.ffn_.*=CUDA1" \
 
307
 
308
  </details>
309
 
310
+ #### `IQ2_KT` Not Yet Released
311
+ I might release my `iq2_kt` "QTIP/exl3/trellis" style quant, but it is rather experimental and the inferencing implementation needs more time to bake. Quality wise it is slightly smaller than the above `IQ2_K_R4` with slightly worse perplexity and KLD.
312
+
313
  #### `IQ1_S_R4` 130.203 GiB (1.664 BPW)
314
 
315
  The world's smallest working DeepSeek-R1-0528 quant!
 
317
  ![KLD Smol Boi Comparison](images/kld-r1-0528-smol-bois.png "Chart showing competitive KLD quality of smallest R1-0528 quants.")
318
  The Delta P numbers for average RMS, 99% percentile, and absolute max divergence from the baseline pure `Q8_0`. Lower is better.
319
 
320
+ If you can fit a larger model completely in RAM+VRAM I would recommend that, but if you have 128GB RAM + 24GB VRAM then give this a try as it is surprisingly usable despite heavy quantization.
 
 
321
 
322
  Support for this is bleeding edge you need [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)!
323
 
324
  Special mix `IQ1_M_R4` `ffn_down` and `IQ1_S_R4` `ffn_(up|gate)` routed experts. All other layers mostly `iq4_ks` for CPU+GPU offload. For max speed on CPU *only* rigs use `--run-time-repack` (only appleis to the `iq4_ks` tensors etc.).
325
 
326
+ Also released [ubergarm/DeepSeek-V3-0324-IQ1_S_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ1_S_R4) with the same recipe and size if you don't want thinking.
327
+
328
  <details>
329
 
330
  <summary>👈 How to run in 128GiB RAM + 24GB VRAM</summary>
 
337
 
338
  This will fit in ~116.1GiB RAM plus 22448MiB VRAM. You can strip it down more and get another layer on GPU possibly too or increase context. Good luck!
339
  ```bash
340
+ # You can use more CUDA devices just set them all visibile and do *not* use `-ts ...` with this `-ot ...` strategy.
341
  CUDA_VISIBLE_DEVICES="0" \
342
  ./build/bin/llama-server \
343
+ --model /mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
344
+ --alias ubergarm/DeepSeek-R1-0528-IQ1_S_R4 \
345
  --ctx-size 32768 \
346
  -ctk q8_0 \
347
  -mla 3 -fa \
 
371
 
372
  <details>
373
 
374
+ ![Reverse Buff Mokey Meme](images/buff-mokey-meme.png "Reverse Buff Mokey Meme Comparing full R1-671B fp8 to smol iq1_s_r4 quant.")
375
 
376
  Possibly useful for 128GiB RAM + 16GB+ VRAM? Maybe? It does actually work and can read python code okay. For all I know it might be better than Qwen3-235B-A22B given the iq1_s_r4 actually has lower PPL!
377
 
378
  Not recommended and slower than a larger quant unless this is the *only* thing you can fit completely in RAM+VRAM as this quant seems slower and less optimized for inferencing and in testing has slower TG and worse quality (higher perplexity). Plus I'm not sure that you can use it with multi-GPU offload so check the ik_llama.cpp PRs as these tiny quants are less used.
379
 
380
+ I recommend to *not* use the `IQ1_S` so use the `IQ1_S_R4` now with the recent updates supporting GPU offload and better speeds with the repacked quant on CUDA.
381
+
382
  <summary>👈 Secret Recipe</summary>
383
 
384
  ```bash
 
490
  -amb 512 \
491
  -fmoe \
492
  --n-gpu-layers 63 \
 
493
  -ot "blk\.(3|4)\.ffn_.*=CUDA0" \
494
  -ot "blk\.(5|6)\.ffn_.*=CUDA1" \
495
  --override-tensor exps=CPU \
 
577
  -mla 3 -fa \
578
  -amb 512 \
579
  -fmoe \
 
580
  --n-gpu-layers 63 \
581
  -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
582
  -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \