Release DeepSeek-V3-0324-IQ4_K_R4 and benchmarks
Browse files- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00002-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00003-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00004-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00005-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00006-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00007-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00008-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00009-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00010-of-00010.gguf +3 -0
- README.md +173 -71
- benchmarks-01.png +3 -0
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eb8ef08b44a99223040cb02d2f89764eb03662669a65c690da670a3770521f57
|
| 3 |
+
size 41169676352
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00002-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7ddbd081cbdad380bb4548c81a2fc43a7f405d306f29678dfa1283b998c0ff3f
|
| 3 |
+
size 42494252256
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00003-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:74789329b86ab85418f361e0e167c627ff94b0c12d27a1acd75823120c6b82e4
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00004-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7c2840f878709701a655caca5ee86952293cf00137677065582eed49595491a4
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00005-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a07cb7b0c4d8693fce701d08e9ec4cb2e693273279ba39fd17c3a1755439e81c
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00006-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:18483856dcc014e7aa32c55b641695ff05095822b86c05c87d901f9d1b3dfee2
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00007-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7c925b58c8394d1e965c930e2f6c415b0ea28cefb4bf6c383575f5e27d60c89a
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00008-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:520fecd53d32111018cd13c235d5731c737865497560726c4d253804476516ae
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00009-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8eba7dada84aad746661978ef4edcd6cf6b12d5a2cb27840d52d49dfeb89d882
|
| 3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00010-of-00010.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1af85ec57870ca34ea61152fd6ee2697bd8a3265006c8965ce80b12904ab1b46
|
| 3 |
+
size 33542014112
|
README.md
CHANGED
|
@@ -4,15 +4,174 @@ pipeline_tag: text-generation
|
|
| 4 |
base_model: deepseek-ai/DeepSeek-V3-0324
|
| 5 |
license: mit
|
| 6 |
base_model_relation: quantized
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
-
## `ik_llma.cpp` imatrix MLA Quantizations of DeepSeek-V3-0324
|
| 10 |
|
| 11 |
-
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download
|
| 12 |
|
| 13 |
-
These quants provide
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
<details>
|
| 18 |
|
|
@@ -59,33 +218,16 @@ Final estimate: PPL = 3.4755 +/- 0.03305
|
|
| 59 |
|
| 60 |
</details>
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
#### `IQ2_K_R4`
|
| 65 |
-
Hybrid `IQ2_K_R4` non-linear quant for 32k context using `q8_0` MLA in for CPU+GPU offload with 96+GB RAM and 24+GB VRAM with minimal perplexity.
|
| 66 |
-
|
| 67 |
-
<details>
|
| 68 |
-
|
| 69 |
-
<summary>`IQ2_K_R4` Details Here</summary>
|
| 70 |
-
|
| 71 |
-
```bash
|
| 72 |
-
$ git branch
|
| 73 |
-
* ik/make_qx_quants
|
| 74 |
-
|
| 75 |
-
$ git rev-parse --short HEAD
|
| 76 |
-
b9c25fe7
|
| 77 |
-
```
|
| 78 |
-
|
| 79 |
-
---
|
| 80 |
-
|
| 81 |
-
## Quantize Script
|
| 82 |
|
| 83 |
```bash
|
| 84 |
#!/usr/bin/env bash
|
| 85 |
|
| 86 |
custom="
|
| 87 |
-
# Token embedding
|
|
|
|
| 88 |
token_embd\.weight=q8_0
|
|
|
|
| 89 |
output\.weight=q8_0
|
| 90 |
output_norm\.weight=q8_0
|
| 91 |
|
|
@@ -93,6 +235,7 @@ output_norm\.weight=q8_0
|
|
| 93 |
blk\.[0-2]\..*=q8_0
|
| 94 |
|
| 95 |
# All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
|
|
|
|
| 96 |
blk\.[3-9]\.attn_.*=q8_0
|
| 97 |
blk\.[1-5][0-9]\.attn_.*=q8_0
|
| 98 |
blk\.60\.attn_.*=q8_0
|
|
@@ -114,7 +257,8 @@ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
|
|
| 114 |
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
|
| 115 |
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
|
| 116 |
|
| 117 |
-
#
|
|
|
|
| 118 |
blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
|
| 119 |
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
|
| 120 |
blk\.60\.ffn_down_exps\.weight=iq3_k_r4
|
|
@@ -140,9 +284,7 @@ custom=$(
|
|
| 140 |
24
|
| 141 |
```
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
## Perplexity
|
| 146 |
|
| 147 |
```bash
|
| 148 |
$ CUDA_VISIBLE_DEVICES="0," \
|
|
@@ -559,12 +701,10 @@ llama_print_timings: total time = 2841519.57 ms / 287233 tokens
|
|
| 559 |
Final estimate: PPL = 3.5614 +/- 0.02001
|
| 560 |
```
|
| 561 |
|
| 562 |
-
|
| 563 |
-
|
| 564 |
-
## Split
|
| 565 |
|
| 566 |
```bash
|
| 567 |
-
$ ./build/bin/llama-gguf-split
|
| 568 |
--dry-run \
|
| 569 |
--split \
|
| 570 |
--split-max-size 50G \
|
|
@@ -574,44 +714,6 @@ $ ./build/bin/llama-gguf-split
|
|
| 574 |
|
| 575 |
</details>
|
| 576 |
|
| 577 |
-
#### `TODO`
|
| 578 |
-
|
| 579 |
-
- [ ] Upload good CPU *only* optimized inferencing quant
|
| 580 |
-
|
| 581 |
-
## `ik_llama.cpp` API server
|
| 582 |
-
|
| 583 |
-
```bash
|
| 584 |
-
# I think temperature "1.0" on the API is 0.3 in llama.cpp ????
|
| 585 |
-
# https://api-docs.deepseek.com/quick_start/parameter_settings
|
| 586 |
-
# https://www.reddit.com/r/LocalLLaMA/comments/1jk0qjs/178bit_deepseekv30324_230gb_unsloth_dynamic_gguf/
|
| 587 |
-
|
| 588 |
-
# Uses just under 24GB VRAM
|
| 589 |
-
CUDA_VISIBLE_DEVICES="0," \
|
| 590 |
-
./build/bin/llama-server \
|
| 591 |
-
--model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
|
| 592 |
-
--alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
|
| 593 |
-
--ctx-size 32768 \
|
| 594 |
-
-ctk q8_0 \
|
| 595 |
-
-mla 2 -fa \
|
| 596 |
-
-amb 512 \
|
| 597 |
-
-fmoe \
|
| 598 |
-
--min-p 0.01 \
|
| 599 |
-
--temp 0.0 \
|
| 600 |
-
--n-gpu-layers 63 \
|
| 601 |
-
--override-tensor exps=CPU \
|
| 602 |
-
--parallel 1 \
|
| 603 |
-
--threads 16 \
|
| 604 |
-
--host 127.0.0.1 \
|
| 605 |
-
--port 8080
|
| 606 |
-
```
|
| 607 |
-
|
| 608 |
-
## Big Thanks
|
| 609 |
-
Big thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for sharing tips and tricks to help each other access all the fun new models!
|
| 610 |
-
|
| 611 |
-
Shout out to the **Level1Techs** crew, community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs), and for providing big hardware expertise and access to run these experiments!!!
|
| 612 |
-
|
| 613 |
-
Finally, I'm still learning the ropes, so please be patient and we can learn together. Thanks!
|
| 614 |
-
|
| 615 |
## References
|
| 616 |
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
|
| 617 |
* [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
|
|
|
|
| 4 |
base_model: deepseek-ai/DeepSeek-V3-0324
|
| 5 |
license: mit
|
| 6 |
base_model_relation: quantized
|
| 7 |
+
tags:
|
| 8 |
+
- mla
|
| 9 |
+
- imatrix
|
| 10 |
+
- deepseek_v3
|
| 11 |
+
- conversational
|
| 12 |
---
|
| 13 |
|
| 14 |
+
## `ik_llma.cpp` imatrix MLA Quantizations of DeepSeek-V3-0324
|
| 15 |
|
| 16 |
+
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
|
| 17 |
|
| 18 |
+
These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
|
| 19 |
|
| 20 |
+
Perfect for CPU+GPU systems with 24GB+ VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput).
|
| 21 |
+
|
| 22 |
+
You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
|
| 23 |
+
|
| 24 |
+
## Big Thanks
|
| 25 |
+
Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
|
| 26 |
+
|
| 27 |
+
Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
|
| 28 |
+
|
| 29 |
+
Excited to share and learn together. Thanks!
|
| 30 |
+
|
| 31 |
+
## Quant Collection
|
| 32 |
+
So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
|
| 33 |
+
|
| 34 |
+
#### `IQ4_K_R4` 4.936 BPW
|
| 35 |
+
Special mix `IQ5_K_R4`/`IQ4_K_R4` routed experts with all other layers full `q8_0` for CPU+GPU offload or `--run-time-repack` for max speed CPU *only* rigs.
|
| 36 |
+
Great for big 384+ GB RAM rig with 24GB+ GPU
|
| 37 |
+
|
| 38 |
+
#### `IQ2_K_R4` 2.889 BPW
|
| 39 |
+
Special mix `IQ3_K_R4`/`IQ2_K_R4` routed experts with all other layers full `q8_0` for CPU+GPU offload or `--run-time-repack` for max speed CPU *only* rigs.
|
| 40 |
+
Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
|
| 41 |
+
|
| 42 |
+
#### Custom Mixes
|
| 43 |
+
If you have multiple GPUs and more VRAM, you can make custom quants to optimize size and quants whatever hardware you have. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
|
| 44 |
+
|
| 45 |
+
## Quick Start
|
| 46 |
+
#### `ik_llama.cpp` API server for GPU+CPU
|
| 47 |
+
```bash
|
| 48 |
+
# Fits 32k context in under 24GB VRAM
|
| 49 |
+
# Optional `-ser 6,1` improves speed at minimal cost to quality
|
| 50 |
+
CUDA_VISIBLE_DEVICES="0," \
|
| 51 |
+
./build/bin/llama-server \
|
| 52 |
+
--model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
|
| 53 |
+
--alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
|
| 54 |
+
--ctx-size 32768 \
|
| 55 |
+
-ctk q8_0 \
|
| 56 |
+
-mla 2 -fa \
|
| 57 |
+
-amb 512 \
|
| 58 |
+
-fmoe \
|
| 59 |
+
--temp 0.3 \
|
| 60 |
+
--min-p 0.05 \
|
| 61 |
+
--n-gpu-layers 63 \
|
| 62 |
+
--override-tensor exps=CPU \
|
| 63 |
+
--parallel 1 \
|
| 64 |
+
--threads 16 \
|
| 65 |
+
--host 127.0.0.1 \
|
| 66 |
+
--port 8080
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
#### `ik_llama.cpp` API server for CPU *only*
|
| 70 |
+
```
|
| 71 |
+
# The goal for now is as much RAM bandwidth in a single NUMA node e.g.
|
| 72 |
+
# Use BIOS `NPS0` on AMD Epyc or single socket of Intel Xeon in BIOS `SNC=Disable`
|
| 73 |
+
# Tune your `--threads` for token generation, and `--threads-batch` for prompt processing (prefill)
|
| 74 |
+
# Note `--run-time-repack` will pre-allocate enough RAM for model weights instead of mmap()'ing off disk
|
| 75 |
+
# Note there are options for both Explicit and Transparent Huge Pages with tuning discussions in [git repo](https://github.com/ikawrakow/ik_llama.cpp/pull/278#issuecomment-2746381515)
|
| 76 |
+
numactl -N 0 -m 0 \
|
| 77 |
+
./build/bin/llama-server \
|
| 78 |
+
--model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
|
| 79 |
+
--alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
|
| 80 |
+
--run-time-repack \
|
| 81 |
+
--ctx-size 65536 \
|
| 82 |
+
-ctk q8_0 \
|
| 83 |
+
-mla 3 -fa \
|
| 84 |
+
-amb 512 \
|
| 85 |
+
-fmoe \
|
| 86 |
+
--temp 0.3 \
|
| 87 |
+
--min-p 0.05 \
|
| 88 |
+
--parallel 1 \
|
| 89 |
+
--threads 88 \
|
| 90 |
+
--threads-batch 128 \
|
| 91 |
+
--numa numactl \
|
| 92 |
+
--host 127.0.0.1 \
|
| 93 |
+
--port 8080
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
## Quant Comparisons
|
| 97 |
+
|
| 98 |
+
These are probably the **best quants available in this size class** for `V3-0324`!
|
| 99 |
+
|
| 100 |
+
[!][Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
|
| 101 |
+
|
| 102 |
+
ubergarm made no sacrifices for token embedding, attention, dense
|
| 103 |
+
layers, or shared experts. This is possible because `ik_llama.cpp` MLA
|
| 104 |
+
implementation saves so much GPU VRAM enabling 32k context in under 24GB
|
| 105 |
+
VRAM. Also these quants use a new high quality imatrix including various
|
| 106 |
+
coding samples and multiple written languages. Routed expert layers
|
| 107 |
+
make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
|
| 108 |
+
best perplexity per GiB.
|
| 109 |
+
|
| 110 |
+
bartowski uses full token embedding quality but lower attention, dense
|
| 111 |
+
layers, and shared expert quants. He does use a good quality imatrix with
|
| 112 |
+
perplexity performance within the measurement error relative to this one.
|
| 113 |
+
|
| 114 |
+
unsloth sacrifices token embedding with middle quality attention and
|
| 115 |
+
dense layers, but no importance matrix.
|
| 116 |
+
|
| 117 |
+
mradermacher modelcard side-bar is not showing so didn't yet fully
|
| 118 |
+
compare exact recipe. Working with them to get info on their split GGUFs.
|
| 119 |
+
|
| 120 |
+
#### Comparison Details
|
| 121 |
+
|
| 122 |
+
<details>
|
| 123 |
+
|
| 124 |
+
<summary>Detailed Comparison of ~Q2 Class Quants</summary>
|
| 125 |
+
|
| 126 |
+
| | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
|
| 127 |
+
| --- | --- | --- | --- | --- |
|
| 128 |
+
| **Overview** | | | | |
|
| 129 |
+
| `split.tensors.count` | 1147 | 1025 | 1025 | |
|
| 130 |
+
| `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | |
|
| 131 |
+
| File Size (GiB) | 227 | 228 | 231 | |
|
| 132 |
+
| **Multi-Head Latent Attention** | | | | |
|
| 133 |
+
| `blk.*.attn_kv_b.weight` | `Q8_0` | n/a | n/a | n/a |
|
| 134 |
+
| `blk.*.attn_k_b.weight` | `Q8_0` | n/a | n/a | n/a |
|
| 135 |
+
| `blk.*.attn_v_b.weight` | `Q8_0` | n/a | n/a | n/a |
|
| 136 |
+
| **Dense Layers** | | | | |
|
| 137 |
+
| `blk.[0-2].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
| 138 |
+
| `blk.[0-2].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
|
| 139 |
+
| `blk.[0-2].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
| 140 |
+
| `blk.[0-2].attn_norm.weight` | `F32` | `F32` | `F32` | |
|
| 141 |
+
| `blk.[0-2].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 142 |
+
| `blk.[0-2].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
|
| 143 |
+
| `blk.[0-2].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 144 |
+
| `blk.[0-2].ffn_down.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
|
| 145 |
+
| `blk.[0-2].ffn_gate.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 146 |
+
| `blk.[0-2].ffn_norm.weight` | `F32` | `F32` | `F32` | |
|
| 147 |
+
| `blk.[0-2].ffn_up.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 148 |
+
| `blk.[0-2].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
|
| 149 |
+
| **Shared & Routed MoE Layers** | | | | |
|
| 150 |
+
| `blk.[3-60].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
| 151 |
+
| `blk.[3-60].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
|
| 152 |
+
| `blk.[3-60].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
| 153 |
+
| `blk.[3-60].attn_norm.weight` | `F32` | `F32` | `F32` | |
|
| 154 |
+
| `blk.[3-60].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 155 |
+
| `blk.[3-60].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
|
| 156 |
+
| `blk.[3-60].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 157 |
+
| `blk.[3-60].exp_probs_b.bias` | `F32` | `F32` | `F32` | |
|
| 158 |
+
| `blk.[3-60].ffn_down_exps.weight` | `IQ3_K_R4` | `Q3_K` | `Q3_K` | |
|
| 159 |
+
| `blk.[3-60].ffn_down_shexp.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
|
| 160 |
+
| `blk.[3-60].ffn_gate_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
|
| 161 |
+
| `blk.[3-60].ffn_gate_inp.weight` | `F32` | `F32` | `F32` | |
|
| 162 |
+
| `blk.[3-60].ffn_gate_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 163 |
+
| `blk.[3-60].ffn_norm.weight` | `F32` | `F32` | `F32` | |
|
| 164 |
+
| `blk.[3-60].ffn_up_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
|
| 165 |
+
| `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
| 166 |
+
| `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
|
| 167 |
+
| **Important Matrix & Perplexity** | | | | |
|
| 168 |
+
| `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | n/a | ? |
|
| 169 |
+
| Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | ? | ? | ? |
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
</details>
|
| 173 |
+
|
| 174 |
+
#### imatrix
|
| 175 |
|
| 176 |
<details>
|
| 177 |
|
|
|
|
| 218 |
|
| 219 |
</details>
|
| 220 |
|
| 221 |
+
#### Quant Cookers Secret Recipe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
```bash
|
| 224 |
#!/usr/bin/env bash
|
| 225 |
|
| 226 |
custom="
|
| 227 |
+
# Token embedding (GPU)
|
| 228 |
+
# NOTE: cannot be a repacked type due to tensor size
|
| 229 |
token_embd\.weight=q8_0
|
| 230 |
+
# output tensors (GPU)
|
| 231 |
output\.weight=q8_0
|
| 232 |
output_norm\.weight=q8_0
|
| 233 |
|
|
|
|
| 235 |
blk\.[0-2]\..*=q8_0
|
| 236 |
|
| 237 |
# All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
|
| 238 |
+
# NOTE: attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128
|
| 239 |
blk\.[3-9]\.attn_.*=q8_0
|
| 240 |
blk\.[1-5][0-9]\.attn_.*=q8_0
|
| 241 |
blk\.60\.attn_.*=q8_0
|
|
|
|
| 257 |
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
|
| 258 |
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
|
| 259 |
|
| 260 |
+
# Routed Experts (3-60) (CPU)
|
| 261 |
+
# NOTE: Traditional wisdom suggests earlier layers use higher quants
|
| 262 |
blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
|
| 263 |
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
|
| 264 |
blk\.60\.ffn_down_exps\.weight=iq3_k_r4
|
|
|
|
| 284 |
24
|
| 285 |
```
|
| 286 |
|
| 287 |
+
#### Perplexity
|
|
|
|
|
|
|
| 288 |
|
| 289 |
```bash
|
| 290 |
$ CUDA_VISIBLE_DEVICES="0," \
|
|
|
|
| 701 |
Final estimate: PPL = 3.5614 +/- 0.02001
|
| 702 |
```
|
| 703 |
|
| 704 |
+
#### Split
|
|
|
|
|
|
|
| 705 |
|
| 706 |
```bash
|
| 707 |
+
$ ./build/bin/llama-gguf-split \
|
| 708 |
--dry-run \
|
| 709 |
--split \
|
| 710 |
--split-max-size 50G \
|
|
|
|
| 714 |
|
| 715 |
</details>
|
| 716 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 717 |
## References
|
| 718 |
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
|
| 719 |
* [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
|
benchmarks-01.png
ADDED
|
Git LFS Details
|