ddh0 commited on
Commit
84fe9f2
·
verified ·
1 Parent(s): 57b6bc3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +445 -3
README.md CHANGED
@@ -1,3 +1,445 @@
1
- ---
2
- license: unknown
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: unknown
3
+ ---
4
+
5
+ # Tensor Type Testing
6
+
7
+ > [!TIP]
8
+ > Skip to the bottom of this document for a TL;DR
9
+
10
+ For more info, see [llama.cpp #12511: Handle user-defined quantization levels for additional tensors](https://github.com/ggml-org/llama.cpp/pull/12511) by @EAddario
11
+
12
+ Testing done by @ddh0 using [this branch](https://github.com/EAddario/llama.cpp/tree/quantize) as of committ [5a304b8](https://github.com/EAddario/llama.cpp/commit/5a304b8e26b8c53f43e8d12515e52f9bb7d199f0). Using libllama built for Linux CUDA.
13
+
14
+ ## Quantization naming scheme
15
+
16
+ ```
17
+ Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf
18
+ ```
19
+
20
+ for example `Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf`:
21
+ - Model is Llama 3.1 8B Instruct
22
+ - TYPE_EMBD (token embeddings) are in Q4_K
23
+ - TYPE_FFN (MLP / feed-forward tensors) are in Q4_K
24
+ - TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0
25
+ - TYPE_OUTPUT (output tensor) is in Q8_0
26
+
27
+ ---
28
+
29
+ ## Command template
30
+
31
+ ```bash
32
+ TYPE_EMBD=GGML_TYPE
33
+ TYPE_FFN=GGML_TYPE
34
+ TYPE_ATTN=GGML_TYPE
35
+ TYPE_OUTPUT=GGML_TYPE
36
+ SRC_GGUF=/my/model/orig.gguf
37
+ DST_GGUF=/my/model/quant.gguf
38
+ N_THREADS=4
39
+
40
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
41
+ ```
42
+
43
+ ---
44
+
45
+ ## Commands used for Llama 3.2
46
+
47
+ ---
48
+
49
+ ### Crush token embeddings to Q2_K, otherwise Q8_0
50
+
51
+ ```bash
52
+ TYPE_EMBD=Q2_K
53
+ TYPE_FFN=Q8_0
54
+ TYPE_ATTN=Q8_0
55
+ TYPE_OUTPUT=Q8_0
56
+ SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
57
+ DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
58
+ N_THREADS=16
59
+
60
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
61
+ ```
62
+
63
+ ---
64
+
65
+ ### Crush FFN to Q2_K, otherwise Q8_0
66
+
67
+ ```bash
68
+ TYPE_EMBD=Q8_0
69
+ TYPE_FFN=Q2_K
70
+ TYPE_ATTN=Q8_0
71
+ TYPE_OUTPUT=Q8_0
72
+ SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
73
+ DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
74
+ N_THREADS=16
75
+
76
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
77
+ ```
78
+
79
+ ---
80
+
81
+ ### Crush attention to Q2_K, otherwise Q8_0
82
+
83
+ ```bash
84
+ TYPE_EMBD=Q8_0
85
+ TYPE_FFN=Q8_0
86
+ TYPE_ATTN=Q2_K
87
+ TYPE_OUTPUT=Q8_0
88
+ SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
89
+ DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
90
+ N_THREADS=16
91
+
92
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
93
+ ```
94
+
95
+ ---
96
+
97
+ ### Crush output tensor to Q2_K, otherwise Q8_0 ⚠️
98
+
99
+ > **This quant was not included in the testing because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.**
100
+
101
+ ```bash
102
+ TYPE_EMBD=Q8_0
103
+ TYPE_FFN=Q8_0
104
+ TYPE_ATTN=Q8_0
105
+ TYPE_OUTPUT=Q2_K
106
+ SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
107
+ DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
108
+ N_THREADS=16
109
+
110
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Raw results for Llama 3.2 3B
116
+
117
+ ```
118
+ Number of input texts: 10
119
+ Shortest input length in tokens: 55
120
+ Longest input length in tokens: 4678
121
+ Average input length in tokens: 1605.5
122
+ Total number of input tokens: 16055
123
+ --------------------------------------------------------------------------------
124
+ Evaluating baseline model Llama-3.2-3B-BF16.gguf...
125
+ Load model...
126
+ Evaluate prompts...
127
+ Unload model...
128
+ --------------------------------------------------------------------------------
129
+ Now processing: Llama-3.2-3B-Q2_K.gguf
130
+ Load model...
131
+ Evaluate prompts...
132
+ Unload model...
133
+ Compute MSD...
134
+ Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf:
135
+ -- Prompt 0: 1.2261667251586914
136
+ -- Prompt 1: 1.1347604990005493
137
+ -- Prompt 2: 1.388033390045166
138
+ -- Prompt 3: 1.1053369045257568
139
+ -- Prompt 4: 1.7510676383972168
140
+ -- Prompt 5: 4.586221218109131
141
+ -- Prompt 6: 1.3651360273361206
142
+ -- Prompt 7: 0.8970077037811279
143
+ -- Prompt 8: 0.3409916162490845
144
+ -- Prompt 9: 1.2506738901138306
145
+ Average MSD: 1.5045396089553833
146
+ --------------------------------------------------------------------------------
147
+ Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
148
+ Load model...
149
+ Evaluate prompts...
150
+ Unload model...
151
+ Compute MSD...
152
+ Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
153
+ -- Prompt 0: 0.3589555025100708
154
+ -- Prompt 1: 0.1420530527830124
155
+ -- Prompt 2: 0.3871675133705139
156
+ -- Prompt 3: 0.38336610794067383
157
+ -- Prompt 4: 0.4630553722381592
158
+ -- Prompt 5: 0.3928600549697876
159
+ -- Prompt 6: 0.46294596791267395
160
+ -- Prompt 7: 0.41983363032341003
161
+ -- Prompt 8: 0.0822080597281456
162
+ -- Prompt 9: 0.3548887372016907
163
+ Average MSD: 0.34473341703414917
164
+ --------------------------------------------------------------------------------
165
+ Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
166
+ Load model...
167
+ Evaluate prompts...
168
+ Unload model...
169
+ Compute MSD...
170
+ Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
171
+ -- Prompt 0: 4.409396648406982
172
+ -- Prompt 1: 2.431891679763794
173
+ -- Prompt 2: 5.892056941986084
174
+ -- Prompt 3: 4.688146591186523
175
+ -- Prompt 4: 6.351741313934326
176
+ -- Prompt 5: 8.826679229736328
177
+ -- Prompt 6: 4.506043434143066
178
+ -- Prompt 7: 4.613113880157471
179
+ -- Prompt 8: 1.0596126317977905
180
+ -- Prompt 9: 4.1558661460876465
181
+ Average MSD: 4.693454742431641
182
+ --------------------------------------------------------------------------------
183
+ Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
184
+ Load model...
185
+ Evaluate prompts...
186
+ Unload model...
187
+ Compute MSD...
188
+ Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
189
+ -- Prompt 0: 1.0618470907211304
190
+ -- Prompt 1: 1.1212399005889893
191
+ -- Prompt 2: 1.3122810125350952
192
+ -- Prompt 3: 0.9195016026496887
193
+ -- Prompt 4: 1.201547622680664
194
+ -- Prompt 5: 5.760651111602783
195
+ -- Prompt 6: 1.0914928913116455
196
+ -- Prompt 7: 0.9646959900856018
197
+ -- Prompt 8: 0.41648873686790466
198
+ -- Prompt 9: 1.4317259788513184
199
+ Average MSD: 1.5281471014022827
200
+ --------------------------------------------------------------------------------
201
+ Now processing: Llama-3.2-3B-Q8_0.gguf
202
+ Load model...
203
+ Evaluate prompts...
204
+ Unload model...
205
+ Compute MSD...
206
+ Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf:
207
+ -- Prompt 0: 0.0023212190717458725
208
+ -- Prompt 1: 0.0014450754970312119
209
+ -- Prompt 2: 0.003914575092494488
210
+ -- Prompt 3: 0.002514646854251623
211
+ -- Prompt 4: 0.003313937224447727
212
+ -- Prompt 5: 0.004224818665534258
213
+ -- Prompt 6: 0.0026909655425697565
214
+ -- Prompt 7: 0.0033839084208011627
215
+ -- Prompt 8: 0.0015104531776160002
216
+ -- Prompt 9: 0.002354747150093317
217
+ Average MSD: 0.0027674345765262842
218
+ --------------------------------------------------------------------------------
219
+ Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf:
220
+ --------------------------------------------------------------------------------
221
+ Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833
222
+ Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917
223
+ Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641
224
+ Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827
225
+ Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842
226
+ --------------------------------------------------------------------------------
227
+ ```
228
+
229
+ ---
230
+
231
+ ## Commands used for Qwen2.5-14B
232
+
233
+ ---
234
+
235
+ ### Crush token embeddings to Q2_K, otherwise Q8_0
236
+
237
+ ```bash
238
+ TYPE_EMBD=Q2_K
239
+ TYPE_FFN=Q8_0
240
+ TYPE_ATTN=Q8_0
241
+ TYPE_OUTPUT=Q8_0
242
+ SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
243
+ DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
244
+ N_THREADS=16
245
+
246
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
247
+ ```
248
+
249
+ ---
250
+
251
+ ### Crush FFNs to Q2_K, otherwise Q8_0
252
+
253
+ ```bash
254
+ TYPE_EMBD=Q8_0
255
+ TYPE_FFN=Q2_K
256
+ TYPE_ATTN=Q8_0
257
+ TYPE_OUTPUT=Q8_0
258
+ SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
259
+ DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
260
+ N_THREADS=16
261
+
262
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
263
+ ```
264
+
265
+ ---
266
+
267
+ ### Crush attention to Q2_K, otherwise Q8_0
268
+
269
+ ```bash
270
+ TYPE_EMBD=Q8_0
271
+ TYPE_FFN=Q8_0
272
+ TYPE_ATTN=Q2_K
273
+ TYPE_OUTPUT=Q8_0
274
+ SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
275
+ DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
276
+ N_THREADS=16
277
+
278
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
279
+ ```
280
+
281
+ ---
282
+
283
+ ### Crush output tensor to Q2_K, otherwise Q8_0
284
+
285
+ ```bash
286
+ TYPE_EMBD=Q8_0
287
+ TYPE_FFN=Q8_0
288
+ TYPE_ATTN=Q8_0
289
+ TYPE_OUTPUT=Q2_K
290
+ SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
291
+ DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
292
+ N_THREADS=16
293
+
294
+ ./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
295
+ ```
296
+
297
+ ---
298
+
299
+ ## Raw results for Qwen2.5-14B
300
+
301
+ ```
302
+ Number of input texts: 10
303
+ Shortest input length in tokens: 60
304
+ Longest input length in tokens: 4801
305
+ Average input length in tokens: 1589.3
306
+ Total number of input tokens: 15893
307
+ --------------------------------------------------------------------------------
308
+ Evaluating baseline model Qwen2.5-14B-BF16.gguf...
309
+ Load model...
310
+ Evaluate prompts...
311
+ Unload model...
312
+ --------------------------------------------------------------------------------
313
+ Now processing: Qwen2.5-14B-Q2_K.gguf
314
+ Load model...
315
+ Evaluate prompts...
316
+ Unload model...
317
+ Compute MSD...
318
+ Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf:
319
+ -- Prompt 0: 1.568434476852417
320
+ -- Prompt 1: 1.8605916500091553
321
+ -- Prompt 2: 1.2912431955337524
322
+ -- Prompt 3: 1.3367090225219727
323
+ -- Prompt 4: 1.1364308595657349
324
+ -- Prompt 5: 2.3384993076324463
325
+ -- Prompt 6: 1.2926896810531616
326
+ -- Prompt 7: 1.4084643125534058
327
+ -- Prompt 8: 0.32443684339523315
328
+ -- Prompt 9: 1.3756331205368042
329
+ Average MSD: 1.3933132886886597
330
+ --------------------------------------------------------------------------------
331
+ Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
332
+ Load model...
333
+ Evaluate prompts...
334
+ Unload model...
335
+ Compute MSD...
336
+ Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
337
+ -- Prompt 0: 0.012962134554982185
338
+ -- Prompt 1: 0.019185630604624748
339
+ -- Prompt 2: 0.05430002510547638
340
+ -- Prompt 3: 0.008174948394298553
341
+ -- Prompt 4: 0.011592703871428967
342
+ -- Prompt 5: 0.012105505913496017
343
+ -- Prompt 6: 0.007557644974440336
344
+ -- Prompt 7: 0.01957087405025959
345
+ -- Prompt 8: 0.013395288027822971
346
+ -- Prompt 9: 0.007488884497433901
347
+ Average MSD: 0.01663336530327797
348
+ --------------------------------------------------------------------------------
349
+ Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
350
+ Load model...
351
+ Evaluate prompts...
352
+ Unload model...
353
+ Compute MSD...
354
+ Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
355
+ -- Prompt 0: 2.483222246170044
356
+ -- Prompt 1: 2.20788836479187
357
+ -- Prompt 2: 2.2648935317993164
358
+ -- Prompt 3: 2.175588607788086
359
+ -- Prompt 4: 1.624481439590454
360
+ -- Prompt 5: 4.104475498199463
361
+ -- Prompt 6: 2.0161893367767334
362
+ -- Prompt 7: 2.0660784244537354
363
+ -- Prompt 8: 0.46407243609428406
364
+ -- Prompt 9: 2.1939690113067627
365
+ Average MSD: 2.160086154937744
366
+ --------------------------------------------------------------------------------
367
+ Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
368
+ Load model...
369
+ Evaluate prompts...
370
+ Unload model...
371
+ Compute MSD...
372
+ Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
373
+ -- Prompt 0: 0.7283403277397156
374
+ -- Prompt 1: 1.0912593603134155
375
+ -- Prompt 2: 0.9022651314735413
376
+ -- Prompt 3: 0.4880850911140442
377
+ -- Prompt 4: 0.29713207483291626
378
+ -- Prompt 5: 0.6994995474815369
379
+ -- Prompt 6: 0.45846545696258545
380
+ -- Prompt 7: 0.5286242365837097
381
+ -- Prompt 8: 0.2947601079940796
382
+ -- Prompt 9: 0.5722559690475464
383
+ Average MSD: 0.6060687303543091
384
+ --------------------------------------------------------------------------------
385
+ Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
386
+ Load model...
387
+ Evaluate prompts...
388
+ Unload model...
389
+ Compute MSD...
390
+ Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf:
391
+ -- Prompt 0: 1.2783535718917847
392
+ -- Prompt 1: 0.4481557607650757
393
+ -- Prompt 2: 1.1880418062210083
394
+ -- Prompt 3: 1.0997036695480347
395
+ -- Prompt 4: 0.8093082308769226
396
+ -- Prompt 5: 0.6486296057701111
397
+ -- Prompt 6: 1.1238276958465576
398
+ -- Prompt 7: 1.1459368467330933
399
+ -- Prompt 8: 0.23579858243465424
400
+ -- Prompt 9: 1.238993525505066
401
+ Average MSD: 0.9216748476028442
402
+ --------------------------------------------------------------------------------
403
+ Now processing: Qwen2.5-14B-Q8_0.gguf
404
+ Load model...
405
+ Evaluate prompts...
406
+ Unload model...
407
+ Compute MSD...
408
+ Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf:
409
+ -- Prompt 0: 0.0059487177059054375
410
+ -- Prompt 1: 0.004823403432965279
411
+ -- Prompt 2: 0.011750683188438416
412
+ -- Prompt 3: 0.004459250718355179
413
+ -- Prompt 4: 0.004037810489535332
414
+ -- Prompt 5: 0.0039064036682248116
415
+ -- Prompt 6: 0.004684466868638992
416
+ -- Prompt 7: 0.004520604852586985
417
+ -- Prompt 8: 0.004727284424006939
418
+ -- Prompt 9: 0.004541514907032251
419
+ Average MSD: 0.0053400141187012196
420
+ --------------------------------------------------------------------------------
421
+ Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf:
422
+ --------------------------------------------------------------------------------
423
+ Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597
424
+ Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797
425
+ Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744
426
+ Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091
427
+ Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442
428
+ Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196
429
+ --------------------------------------------------------------------------------
430
+ ```
431
+
432
+ ---
433
+
434
+ ## TL;DR
435
+
436
+ Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):
437
+
438
+ | | Q2_K | Crush TYPE_EMBD | Crush TYPE_FFN | Crush TYPE_ATTN | Crush TYPE_OUTPUT | Q8_0 |
439
+ | ------------ | -------- | --------------- | -------------- | --------------- | ----------------- | ---------- |
440
+ | Llama 3.2 3B | 1.504 | 0.344 | 4.693 | 1.528 | N/A | 0.002 |
441
+ | Qwen2.5-14B | 1.393 | 0.016 | 2.160 | 0.606 | 0.921 | 0.005 |
442
+ | **Average** | **1.44** | **0.18** | **3.42** | **1.06** | **0.921** | **0.0035** |
443
+
444
+ In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.
445
+