--- base_model: Qwen/Qwen3-Coder-480B-A35B-Instruct license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/LICENSE base_model_relation: quantized tags: - qwen3_moe - conversational - ik_llama.cpp --- This is an IQ1_M_R4 quant of Qwen3-Coder-480B-A35B-Instruct, using the [imatrix](https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/blob/main/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat) from [ubergarm](https://huggingface.co/ubergarm), with a slightly modified recipe: ``` # Attention blk\..*\.attn_q.*=iq6_k blk\..*\.attn_k.*=q8_0 blk\..*\.attn_v.*=q8_0 blk\..*\.attn_output.*=iq6_k # Routed Experts blk\..*\.ffn_down_exps\.weight=iq1_m_r4 blk\..*\.ffn_(gate|up)_exps\.weight=iq1_m_r4 # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq6_k ``` The file size is 112749523904 bytes, or 105.006 GiB. With the routed experts fully offloaded to CPU, the memory usage (before context) will be: ``` llm_load_tensors: CPU buffer size = 97863.13 MiB llm_load_tensors: CUDA_Host buffer size = 500.77 MiB llm_load_tensors: CUDA0 buffer size = 9156.73 MiB ``` I have a code refactoring eval that uses greedy decoding with deterministic output, and this is the smallest quant I tested that can still perform the same as the BF16 version (openrouter, alibaba provider). Note that this quant only works with [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/). Big thanks to ubergarm for sharing the imatrix, the recipes, and most importantly the detailed guidance and thought processes.