---
base_model: Qwen/Qwen3-Coder-480B-A35B-Instruct
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/LICENSE
base_model_relation: quantized
tags:
- qwen3_moe
- conversational
- ik_llama.cpp
---

This is an IQ1_M_R4 quant of Qwen3-Coder-480B-A35B-Instruct, using the [imatrix](https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/blob/main/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat) from [ubergarm](https://huggingface.co/ubergarm), with a slightly modified recipe:
```
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq1_m_r4
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_m_r4

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
```
The file size is 112749523904 bytes, or 105.006 GiB.

With the routed experts fully offloaded to CPU, the memory usage (before context) will be:
```
llm_load_tensors:        CPU buffer size = 97863.13 MiB
llm_load_tensors:  CUDA_Host buffer size =   500.77 MiB
llm_load_tensors:      CUDA0 buffer size =  9156.73 MiB
```

I have a code refactoring eval that uses greedy decoding with deterministic output, and this is the smallest quant I tested that can still perform the same as the BF16 version (openrouter, alibaba provider).

Note that this quant only works with [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/).

Big thanks to ubergarm for sharing the imatrix, the recipes, and most importantly the detailed guidance and thought processes.