Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,445 @@
|
|
1 |
-
---
|
2 |
-
license: unknown
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: unknown
|
3 |
+
---
|
4 |
+
|
5 |
+
# Tensor Type Testing
|
6 |
+
|
7 |
+
> [!TIP]
|
8 |
+
> Skip to the bottom of this document for a TL;DR
|
9 |
+
|
10 |
+
For more info, see [llama.cpp #12511: Handle user-defined quantization levels for additional tensors](https://github.com/ggml-org/llama.cpp/pull/12511) by @EAddario
|
11 |
+
|
12 |
+
Testing done by @ddh0 using [this branch](https://github.com/EAddario/llama.cpp/tree/quantize) as of committ [5a304b8](https://github.com/EAddario/llama.cpp/commit/5a304b8e26b8c53f43e8d12515e52f9bb7d199f0). Using libllama built for Linux CUDA.
|
13 |
+
|
14 |
+
## Quantization naming scheme
|
15 |
+
|
16 |
+
```
|
17 |
+
Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf
|
18 |
+
```
|
19 |
+
|
20 |
+
for example `Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf`:
|
21 |
+
- Model is Llama 3.1 8B Instruct
|
22 |
+
- TYPE_EMBD (token embeddings) are in Q4_K
|
23 |
+
- TYPE_FFN (MLP / feed-forward tensors) are in Q4_K
|
24 |
+
- TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0
|
25 |
+
- TYPE_OUTPUT (output tensor) is in Q8_0
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
## Command template
|
30 |
+
|
31 |
+
```bash
|
32 |
+
TYPE_EMBD=GGML_TYPE
|
33 |
+
TYPE_FFN=GGML_TYPE
|
34 |
+
TYPE_ATTN=GGML_TYPE
|
35 |
+
TYPE_OUTPUT=GGML_TYPE
|
36 |
+
SRC_GGUF=/my/model/orig.gguf
|
37 |
+
DST_GGUF=/my/model/quant.gguf
|
38 |
+
N_THREADS=4
|
39 |
+
|
40 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
41 |
+
```
|
42 |
+
|
43 |
+
---
|
44 |
+
|
45 |
+
## Commands used for Llama 3.2
|
46 |
+
|
47 |
+
---
|
48 |
+
|
49 |
+
### Crush token embeddings to Q2_K, otherwise Q8_0
|
50 |
+
|
51 |
+
```bash
|
52 |
+
TYPE_EMBD=Q2_K
|
53 |
+
TYPE_FFN=Q8_0
|
54 |
+
TYPE_ATTN=Q8_0
|
55 |
+
TYPE_OUTPUT=Q8_0
|
56 |
+
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
|
57 |
+
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
|
58 |
+
N_THREADS=16
|
59 |
+
|
60 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
61 |
+
```
|
62 |
+
|
63 |
+
---
|
64 |
+
|
65 |
+
### Crush FFN to Q2_K, otherwise Q8_0
|
66 |
+
|
67 |
+
```bash
|
68 |
+
TYPE_EMBD=Q8_0
|
69 |
+
TYPE_FFN=Q2_K
|
70 |
+
TYPE_ATTN=Q8_0
|
71 |
+
TYPE_OUTPUT=Q8_0
|
72 |
+
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
|
73 |
+
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
|
74 |
+
N_THREADS=16
|
75 |
+
|
76 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
77 |
+
```
|
78 |
+
|
79 |
+
---
|
80 |
+
|
81 |
+
### Crush attention to Q2_K, otherwise Q8_0
|
82 |
+
|
83 |
+
```bash
|
84 |
+
TYPE_EMBD=Q8_0
|
85 |
+
TYPE_FFN=Q8_0
|
86 |
+
TYPE_ATTN=Q2_K
|
87 |
+
TYPE_OUTPUT=Q8_0
|
88 |
+
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
|
89 |
+
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
|
90 |
+
N_THREADS=16
|
91 |
+
|
92 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
93 |
+
```
|
94 |
+
|
95 |
+
---
|
96 |
+
|
97 |
+
### Crush output tensor to Q2_K, otherwise Q8_0 ⚠️
|
98 |
+
|
99 |
+
> **This quant was not included in the testing because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.**
|
100 |
+
|
101 |
+
```bash
|
102 |
+
TYPE_EMBD=Q8_0
|
103 |
+
TYPE_FFN=Q8_0
|
104 |
+
TYPE_ATTN=Q8_0
|
105 |
+
TYPE_OUTPUT=Q2_K
|
106 |
+
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
|
107 |
+
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
|
108 |
+
N_THREADS=16
|
109 |
+
|
110 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
111 |
+
```
|
112 |
+
|
113 |
+
---
|
114 |
+
|
115 |
+
## Raw results for Llama 3.2 3B
|
116 |
+
|
117 |
+
```
|
118 |
+
Number of input texts: 10
|
119 |
+
Shortest input length in tokens: 55
|
120 |
+
Longest input length in tokens: 4678
|
121 |
+
Average input length in tokens: 1605.5
|
122 |
+
Total number of input tokens: 16055
|
123 |
+
--------------------------------------------------------------------------------
|
124 |
+
Evaluating baseline model Llama-3.2-3B-BF16.gguf...
|
125 |
+
Load model...
|
126 |
+
Evaluate prompts...
|
127 |
+
Unload model...
|
128 |
+
--------------------------------------------------------------------------------
|
129 |
+
Now processing: Llama-3.2-3B-Q2_K.gguf
|
130 |
+
Load model...
|
131 |
+
Evaluate prompts...
|
132 |
+
Unload model...
|
133 |
+
Compute MSD...
|
134 |
+
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf:
|
135 |
+
-- Prompt 0: 1.2261667251586914
|
136 |
+
-- Prompt 1: 1.1347604990005493
|
137 |
+
-- Prompt 2: 1.388033390045166
|
138 |
+
-- Prompt 3: 1.1053369045257568
|
139 |
+
-- Prompt 4: 1.7510676383972168
|
140 |
+
-- Prompt 5: 4.586221218109131
|
141 |
+
-- Prompt 6: 1.3651360273361206
|
142 |
+
-- Prompt 7: 0.8970077037811279
|
143 |
+
-- Prompt 8: 0.3409916162490845
|
144 |
+
-- Prompt 9: 1.2506738901138306
|
145 |
+
Average MSD: 1.5045396089553833
|
146 |
+
--------------------------------------------------------------------------------
|
147 |
+
Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
|
148 |
+
Load model...
|
149 |
+
Evaluate prompts...
|
150 |
+
Unload model...
|
151 |
+
Compute MSD...
|
152 |
+
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
|
153 |
+
-- Prompt 0: 0.3589555025100708
|
154 |
+
-- Prompt 1: 0.1420530527830124
|
155 |
+
-- Prompt 2: 0.3871675133705139
|
156 |
+
-- Prompt 3: 0.38336610794067383
|
157 |
+
-- Prompt 4: 0.4630553722381592
|
158 |
+
-- Prompt 5: 0.3928600549697876
|
159 |
+
-- Prompt 6: 0.46294596791267395
|
160 |
+
-- Prompt 7: 0.41983363032341003
|
161 |
+
-- Prompt 8: 0.0822080597281456
|
162 |
+
-- Prompt 9: 0.3548887372016907
|
163 |
+
Average MSD: 0.34473341703414917
|
164 |
+
--------------------------------------------------------------------------------
|
165 |
+
Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
|
166 |
+
Load model...
|
167 |
+
Evaluate prompts...
|
168 |
+
Unload model...
|
169 |
+
Compute MSD...
|
170 |
+
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
|
171 |
+
-- Prompt 0: 4.409396648406982
|
172 |
+
-- Prompt 1: 2.431891679763794
|
173 |
+
-- Prompt 2: 5.892056941986084
|
174 |
+
-- Prompt 3: 4.688146591186523
|
175 |
+
-- Prompt 4: 6.351741313934326
|
176 |
+
-- Prompt 5: 8.826679229736328
|
177 |
+
-- Prompt 6: 4.506043434143066
|
178 |
+
-- Prompt 7: 4.613113880157471
|
179 |
+
-- Prompt 8: 1.0596126317977905
|
180 |
+
-- Prompt 9: 4.1558661460876465
|
181 |
+
Average MSD: 4.693454742431641
|
182 |
+
--------------------------------------------------------------------------------
|
183 |
+
Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
|
184 |
+
Load model...
|
185 |
+
Evaluate prompts...
|
186 |
+
Unload model...
|
187 |
+
Compute MSD...
|
188 |
+
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
|
189 |
+
-- Prompt 0: 1.0618470907211304
|
190 |
+
-- Prompt 1: 1.1212399005889893
|
191 |
+
-- Prompt 2: 1.3122810125350952
|
192 |
+
-- Prompt 3: 0.9195016026496887
|
193 |
+
-- Prompt 4: 1.201547622680664
|
194 |
+
-- Prompt 5: 5.760651111602783
|
195 |
+
-- Prompt 6: 1.0914928913116455
|
196 |
+
-- Prompt 7: 0.9646959900856018
|
197 |
+
-- Prompt 8: 0.41648873686790466
|
198 |
+
-- Prompt 9: 1.4317259788513184
|
199 |
+
Average MSD: 1.5281471014022827
|
200 |
+
--------------------------------------------------------------------------------
|
201 |
+
Now processing: Llama-3.2-3B-Q8_0.gguf
|
202 |
+
Load model...
|
203 |
+
Evaluate prompts...
|
204 |
+
Unload model...
|
205 |
+
Compute MSD...
|
206 |
+
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf:
|
207 |
+
-- Prompt 0: 0.0023212190717458725
|
208 |
+
-- Prompt 1: 0.0014450754970312119
|
209 |
+
-- Prompt 2: 0.003914575092494488
|
210 |
+
-- Prompt 3: 0.002514646854251623
|
211 |
+
-- Prompt 4: 0.003313937224447727
|
212 |
+
-- Prompt 5: 0.004224818665534258
|
213 |
+
-- Prompt 6: 0.0026909655425697565
|
214 |
+
-- Prompt 7: 0.0033839084208011627
|
215 |
+
-- Prompt 8: 0.0015104531776160002
|
216 |
+
-- Prompt 9: 0.002354747150093317
|
217 |
+
Average MSD: 0.0027674345765262842
|
218 |
+
--------------------------------------------------------------------------------
|
219 |
+
Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf:
|
220 |
+
--------------------------------------------------------------------------------
|
221 |
+
Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833
|
222 |
+
Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917
|
223 |
+
Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641
|
224 |
+
Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827
|
225 |
+
Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842
|
226 |
+
--------------------------------------------------------------------------------
|
227 |
+
```
|
228 |
+
|
229 |
+
---
|
230 |
+
|
231 |
+
## Commands used for Qwen2.5-14B
|
232 |
+
|
233 |
+
---
|
234 |
+
|
235 |
+
### Crush token embeddings to Q2_K, otherwise Q8_0
|
236 |
+
|
237 |
+
```bash
|
238 |
+
TYPE_EMBD=Q2_K
|
239 |
+
TYPE_FFN=Q8_0
|
240 |
+
TYPE_ATTN=Q8_0
|
241 |
+
TYPE_OUTPUT=Q8_0
|
242 |
+
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
|
243 |
+
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
|
244 |
+
N_THREADS=16
|
245 |
+
|
246 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
247 |
+
```
|
248 |
+
|
249 |
+
---
|
250 |
+
|
251 |
+
### Crush FFNs to Q2_K, otherwise Q8_0
|
252 |
+
|
253 |
+
```bash
|
254 |
+
TYPE_EMBD=Q8_0
|
255 |
+
TYPE_FFN=Q2_K
|
256 |
+
TYPE_ATTN=Q8_0
|
257 |
+
TYPE_OUTPUT=Q8_0
|
258 |
+
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
|
259 |
+
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
|
260 |
+
N_THREADS=16
|
261 |
+
|
262 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
263 |
+
```
|
264 |
+
|
265 |
+
---
|
266 |
+
|
267 |
+
### Crush attention to Q2_K, otherwise Q8_0
|
268 |
+
|
269 |
+
```bash
|
270 |
+
TYPE_EMBD=Q8_0
|
271 |
+
TYPE_FFN=Q8_0
|
272 |
+
TYPE_ATTN=Q2_K
|
273 |
+
TYPE_OUTPUT=Q8_0
|
274 |
+
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
|
275 |
+
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
|
276 |
+
N_THREADS=16
|
277 |
+
|
278 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
279 |
+
```
|
280 |
+
|
281 |
+
---
|
282 |
+
|
283 |
+
### Crush output tensor to Q2_K, otherwise Q8_0
|
284 |
+
|
285 |
+
```bash
|
286 |
+
TYPE_EMBD=Q8_0
|
287 |
+
TYPE_FFN=Q8_0
|
288 |
+
TYPE_ATTN=Q8_0
|
289 |
+
TYPE_OUTPUT=Q2_K
|
290 |
+
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
|
291 |
+
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
|
292 |
+
N_THREADS=16
|
293 |
+
|
294 |
+
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
|
295 |
+
```
|
296 |
+
|
297 |
+
---
|
298 |
+
|
299 |
+
## Raw results for Qwen2.5-14B
|
300 |
+
|
301 |
+
```
|
302 |
+
Number of input texts: 10
|
303 |
+
Shortest input length in tokens: 60
|
304 |
+
Longest input length in tokens: 4801
|
305 |
+
Average input length in tokens: 1589.3
|
306 |
+
Total number of input tokens: 15893
|
307 |
+
--------------------------------------------------------------------------------
|
308 |
+
Evaluating baseline model Qwen2.5-14B-BF16.gguf...
|
309 |
+
Load model...
|
310 |
+
Evaluate prompts...
|
311 |
+
Unload model...
|
312 |
+
--------------------------------------------------------------------------------
|
313 |
+
Now processing: Qwen2.5-14B-Q2_K.gguf
|
314 |
+
Load model...
|
315 |
+
Evaluate prompts...
|
316 |
+
Unload model...
|
317 |
+
Compute MSD...
|
318 |
+
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf:
|
319 |
+
-- Prompt 0: 1.568434476852417
|
320 |
+
-- Prompt 1: 1.8605916500091553
|
321 |
+
-- Prompt 2: 1.2912431955337524
|
322 |
+
-- Prompt 3: 1.3367090225219727
|
323 |
+
-- Prompt 4: 1.1364308595657349
|
324 |
+
-- Prompt 5: 2.3384993076324463
|
325 |
+
-- Prompt 6: 1.2926896810531616
|
326 |
+
-- Prompt 7: 1.4084643125534058
|
327 |
+
-- Prompt 8: 0.32443684339523315
|
328 |
+
-- Prompt 9: 1.3756331205368042
|
329 |
+
Average MSD: 1.3933132886886597
|
330 |
+
--------------------------------------------------------------------------------
|
331 |
+
Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
|
332 |
+
Load model...
|
333 |
+
Evaluate prompts...
|
334 |
+
Unload model...
|
335 |
+
Compute MSD...
|
336 |
+
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
|
337 |
+
-- Prompt 0: 0.012962134554982185
|
338 |
+
-- Prompt 1: 0.019185630604624748
|
339 |
+
-- Prompt 2: 0.05430002510547638
|
340 |
+
-- Prompt 3: 0.008174948394298553
|
341 |
+
-- Prompt 4: 0.011592703871428967
|
342 |
+
-- Prompt 5: 0.012105505913496017
|
343 |
+
-- Prompt 6: 0.007557644974440336
|
344 |
+
-- Prompt 7: 0.01957087405025959
|
345 |
+
-- Prompt 8: 0.013395288027822971
|
346 |
+
-- Prompt 9: 0.007488884497433901
|
347 |
+
Average MSD: 0.01663336530327797
|
348 |
+
--------------------------------------------------------------------------------
|
349 |
+
Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
|
350 |
+
Load model...
|
351 |
+
Evaluate prompts...
|
352 |
+
Unload model...
|
353 |
+
Compute MSD...
|
354 |
+
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
|
355 |
+
-- Prompt 0: 2.483222246170044
|
356 |
+
-- Prompt 1: 2.20788836479187
|
357 |
+
-- Prompt 2: 2.2648935317993164
|
358 |
+
-- Prompt 3: 2.175588607788086
|
359 |
+
-- Prompt 4: 1.624481439590454
|
360 |
+
-- Prompt 5: 4.104475498199463
|
361 |
+
-- Prompt 6: 2.0161893367767334
|
362 |
+
-- Prompt 7: 2.0660784244537354
|
363 |
+
-- Prompt 8: 0.46407243609428406
|
364 |
+
-- Prompt 9: 2.1939690113067627
|
365 |
+
Average MSD: 2.160086154937744
|
366 |
+
--------------------------------------------------------------------------------
|
367 |
+
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
|
368 |
+
Load model...
|
369 |
+
Evaluate prompts...
|
370 |
+
Unload model...
|
371 |
+
Compute MSD...
|
372 |
+
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
|
373 |
+
-- Prompt 0: 0.7283403277397156
|
374 |
+
-- Prompt 1: 1.0912593603134155
|
375 |
+
-- Prompt 2: 0.9022651314735413
|
376 |
+
-- Prompt 3: 0.4880850911140442
|
377 |
+
-- Prompt 4: 0.29713207483291626
|
378 |
+
-- Prompt 5: 0.6994995474815369
|
379 |
+
-- Prompt 6: 0.45846545696258545
|
380 |
+
-- Prompt 7: 0.5286242365837097
|
381 |
+
-- Prompt 8: 0.2947601079940796
|
382 |
+
-- Prompt 9: 0.5722559690475464
|
383 |
+
Average MSD: 0.6060687303543091
|
384 |
+
--------------------------------------------------------------------------------
|
385 |
+
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
|
386 |
+
Load model...
|
387 |
+
Evaluate prompts...
|
388 |
+
Unload model...
|
389 |
+
Compute MSD...
|
390 |
+
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf:
|
391 |
+
-- Prompt 0: 1.2783535718917847
|
392 |
+
-- Prompt 1: 0.4481557607650757
|
393 |
+
-- Prompt 2: 1.1880418062210083
|
394 |
+
-- Prompt 3: 1.0997036695480347
|
395 |
+
-- Prompt 4: 0.8093082308769226
|
396 |
+
-- Prompt 5: 0.6486296057701111
|
397 |
+
-- Prompt 6: 1.1238276958465576
|
398 |
+
-- Prompt 7: 1.1459368467330933
|
399 |
+
-- Prompt 8: 0.23579858243465424
|
400 |
+
-- Prompt 9: 1.238993525505066
|
401 |
+
Average MSD: 0.9216748476028442
|
402 |
+
--------------------------------------------------------------------------------
|
403 |
+
Now processing: Qwen2.5-14B-Q8_0.gguf
|
404 |
+
Load model...
|
405 |
+
Evaluate prompts...
|
406 |
+
Unload model...
|
407 |
+
Compute MSD...
|
408 |
+
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf:
|
409 |
+
-- Prompt 0: 0.0059487177059054375
|
410 |
+
-- Prompt 1: 0.004823403432965279
|
411 |
+
-- Prompt 2: 0.011750683188438416
|
412 |
+
-- Prompt 3: 0.004459250718355179
|
413 |
+
-- Prompt 4: 0.004037810489535332
|
414 |
+
-- Prompt 5: 0.0039064036682248116
|
415 |
+
-- Prompt 6: 0.004684466868638992
|
416 |
+
-- Prompt 7: 0.004520604852586985
|
417 |
+
-- Prompt 8: 0.004727284424006939
|
418 |
+
-- Prompt 9: 0.004541514907032251
|
419 |
+
Average MSD: 0.0053400141187012196
|
420 |
+
--------------------------------------------------------------------------------
|
421 |
+
Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf:
|
422 |
+
--------------------------------------------------------------------------------
|
423 |
+
Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597
|
424 |
+
Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797
|
425 |
+
Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744
|
426 |
+
Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091
|
427 |
+
Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442
|
428 |
+
Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196
|
429 |
+
--------------------------------------------------------------------------------
|
430 |
+
```
|
431 |
+
|
432 |
+
---
|
433 |
+
|
434 |
+
## TL;DR
|
435 |
+
|
436 |
+
Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):
|
437 |
+
|
438 |
+
| | Q2_K | Crush TYPE_EMBD | Crush TYPE_FFN | Crush TYPE_ATTN | Crush TYPE_OUTPUT | Q8_0 |
|
439 |
+
| ------------ | -------- | --------------- | -------------- | --------------- | ----------------- | ---------- |
|
440 |
+
| Llama 3.2 3B | 1.504 | 0.344 | 4.693 | 1.528 | N/A | 0.002 |
|
441 |
+
| Qwen2.5-14B | 1.393 | 0.016 | 2.160 | 0.606 | 0.921 | 0.005 |
|
442 |
+
| **Average** | **1.44** | **0.18** | **3.42** | **1.06** | **0.921** | **0.0035** |
|
443 |
+
|
444 |
+
In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.
|
445 |
+
|