Kimi-Linear-48B-A3B-Instruct AWQ - INT4

Model Details

Quantization Method: AWQ
Bits: 4
Group Size: 32
Calibration Dataset: nvidia/Llama-Nemotron-Post-Training-Dataset
Quantization Tool: llm-compressor

Inference

Prerequisite

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install -e .

Basic Usage

vllm serve cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --trust-remote-code

Additional Information

Changelog

v1.0.0 - Initial quantized release

Authors

Name: cyankiwi
Contacts: ton@cyan.kiwi

Tech Report |

HuggingFace

(a) On MMLU-Pro (4k context length), Kimi Linear achieves 51.0 performance with similar speed as full attention. On RULER (128k context length), it shows Pareto-optimal performance (84.3) and 3.98x speedup. (b) Kimi Linear achieves 6.3x faster TPOT compared to MLA, offering significant speedups at long sequence lengths (1M tokens).

Overview

Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory.

Kimi Linear achieves superior performance and hardware efficiency, especially for long-context tasks. It reduces the need for large KV caches by up to 75% and boosts decoding throughput by up to $6\times$ for contexts as long as 1M tokens.

We open-source the KDA kernel in FLA, and release two versions model checkpoints trained with 5.7T tokens.

Model	#Total Params	#Activated Params	Context Length	Download Link
Kimi-Linear-Base	48B	3B	1M	🤗 Hugging Face
Kimi-Linear-Instruct	48B	3B	1M	🤗 Hugging Face

Key Features

Kimi Delta Attention (KDA): A linear attention mechanism that refines the gated delta rule with finegrained gating.
Hybrid Architecture: A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention.
Superior Performance: Outperforms full attention in a variety of tasks, including long-context and RL-style benchmarks on 1.4T token training runs with fair comparisons.
High Throughput: Achieves up to $6\times$ faster decoding and significantly reduces time per output token (TPOT).

Usage

Inference with Hugging Face Transformers

To use the Kimi Linear model, we recommend the following environment:

python >= 3.10
torch >= 2.6
fla-core >= 0.4.0

pip install -U fla-core

Example Code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
    {"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Deployment

For deployment, you can use the latest vllm to create an OpenAI-compatible API endpoint.

vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --trust-remote-code

Citation

If you found our work useful, please cite

@article{kimi2025kda,
  title  = {Kimi Linear: An Expressive, Efficient Attention Architecture},
  author = {kimi Team},
  year   = {2025},
  url    = {https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf}
}

Downloads last month: 31,714

Safetensors

Model size

9B params

Tensor type

BF16

I64

F32

I32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit

Base model

moonshotai/Kimi-Linear-48B-A3B-Instruct

Quantized

(16)

this model