Juice up attention weights

20c7b82 verified 4 months ago

1.59 kB

metadata

base_model: Qwen/Qwen3-Coder-480B-A35B-Instruct
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/LICENSE
base_model_relation: quantized
tags:
  - qwen3_moe
  - conversational
  - ik_llama.cpp

This is an IQ1_M_R4 quant of Qwen3-Coder-480B-A35B-Instruct, using the imatrix from ubergarm, with a slightly modified recipe:

# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq1_m_r4
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_m_r4

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k

The file size is 112749523904 bytes, or 105.006 GiB.

With the routed experts fully offloaded to CPU, the memory usage (before context) will be:

llm_load_tensors:        CPU buffer size = 97863.13 MiB
llm_load_tensors:  CUDA_Host buffer size =   500.77 MiB
llm_load_tensors:      CUDA0 buffer size =  9156.73 MiB

I have a code refactoring eval that uses greedy decoding with deterministic output, and this is the smallest quant I tested that can still perform the same as the BF16 version (openrouter, alibaba provider).

Note that this quant only works with ik_llama.cpp.

Big thanks to ubergarm for sharing the imatrix, the recipes, and most importantly the detailed guidance and thought processes.