Highlights

KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks.

On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy ⚡ — when evaluated strictly with the SWE-agent scaffold. KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers.

KAT-Dev-72B-Exp-AWQ-INT4

Model Details

Developed by: Kwaipilot
Quantization by Thomas Whitworth
Base model: Kwaipilot/KAT-Dev-72B-Exp
Model type: Decoder-only transformer (Qwen2 architecture)
Parameters: ~72 billion
Quantization method: Activation-Aware Weight Quantization (AWQ)
Weight precision: 4-bit integers (INT4)
Activation precision: FP16 (W4A16)
Format: safetensors with standard AWQ metadata (not compressed_tensors)
Max context length: 131,072 tokens (via NTK-RoPE scaling)
Intended inference engine: LMDeploy

Model Description

KAT-Dev-72B-Exp-AWQ-INT4 is a 4-bit quantized version of the open-source KAT-Dev-72B-Exp model, which is the experimental reinforcement-learning variant of the proprietary KAT-Coder system. The base model achieves 74.6% pass@1 accuracy on SWE-Bench Verified when evaluated with the official SWE-agent scaffold.

This quantized variant was created to enable efficient, long-context inference on accessible hardware (e.g., 2× RTX 3090/4090) while preserving near-original performance. It leverages AWQ to minimize accuracy loss and is optimized for LMDeploy, which supports:

4-bit AWQ weight dequantization
Online 4-bit or 8-bit KV cache quantization
Tensor-parallel inference across multiple GPUs

The model retains the base architecture’s innovations, including a rewritten attention kernel and a training engine optimized for shared-prefix RL trajectories. During RL training, advantage signals were reshaped based on pass rates to prevent exploration collapse—amplifying rewards for highly exploratory trajectories while dampening low-diversity ones.

Intended Use

This model is intended for:

Researchers studying large-scale RL for code generation
Engineers building long-context coding assistants or SWE-agents
Developers who need a powerful coder model deployable on consumer GPUs

It is not recommended for:

Tasks unrelated to software engineering or code reasoning

How to Use & install lmdeploy - LM Deploy

Initialize the project (creates pyproject.toml etc.)

uv init

Pin the project to Python 3.10

uv python pin 3.10

Create the venv (uv will use Python 3.10 because of the pin)

uv venv --python 3.10

add lmdeploy at version 0.10.1

uv add lmdeploy==0.10.1

verify installation was installed correctly

uv run python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
uv run python -c "import lmdeploy; print('lmdeploy imported')"

sync the environment (ensures the venv matches the lockfile)

uv sync

Inference using from CLI With 4Bit Kv Cache:

uv run lmdeploy serve api_server twhitworth/KAT-Dev-72B-Exp-AWQ-INT4-noct --backend turbomind --model-format awq --tp 4 --quant-policy 4 --enable_prefix_caching

Inference using Python:

from lmdeploy import pipeline, TurbomindEngineConfig

engine = TurbomindEngineConfig(
    model_format='awq',
    quant_policy=4,      # 4-bit KV cache
    session_len=131072,
    tp=4                 # 4 GPUs
    enable_prefix_caching=True
)

pipe = pipeline("Kwaipilot/KAT-Dev-72B-Exp-AWQ-INT4", backend_config=engine)
response = pipe("Write a Python function to parse JSON with error handling.")
print(response.text)

For full deployment examples (SWE-agent, long-code analysis, API server), see the README.

Performance Evaluation

SWE-Bench Verified

Base model (FP16): 74.6%
This model (AWQ-INT4): ~73.2% (estimated, <2% drop)

Other Benchmarks (estimated from Qwen2-72B & InternLM2 studies)

Benchmark	Base (FP16)	AWQ-INT4	Δ
MMLU	81.3	80.4	-0.9
GSM8K	70.1	66.9	-3.2
HumanEval	68.2%	66.5%	-1.7

Degradation is minimal for code-related tasks and within acceptable bounds for most engineering applications.

Hardware & Memory Requirements

Component	FP16	INT8	INT4 (this model)
Weights	~144 GB	~72 GB	~36 GB
KV Cache @ 32k	~80 GB	~40 GB	~20 GB
Total VRAM @ 32k	>220 GB	~112 GB	~56 GB

For full deployment examples (SWE-agent, long-code analysis, API server), see the README.

Evaluation Settings (SWE-Agent)

The 74.6% SWE-Bench score was achieved using:

temperature: 0.6
max_turns: 150
history_processors.n: 100

License

Model weights: Apache License 2.0
Quantization: Research use permitted; commercial use requires verification
Base model: Same as Kwaipilot/KAT-Dev-72B-Exp (Apache 2.0)

Citation

If you use this model in your research, please cite the original KAT-Dev-72B-Exp:

@misc{kat2025dev72b,
  author = {Kwaipilot},
  title = {KAT-Dev-72B-Exp: A 72B Reinforcement-Learned Coder Model},
  year = {2025},
  howpublished = {\url{https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp}}
}

Acknowledgements

Built on the Qwen2 architecture
Quantized using Pytorch and LMDeploy
Optimized for inference with LMDeploy TurboMind
Inspired by SWE-Bench

Downloads last month: 65

Safetensors

Model size

2B params

Tensor type

F16

I32

Video Preview

Reinforcement Learning

Model tree for twhitworth/KAT-Dev-72B-Exp-AWQ-INT4-noct

Base model

Kwaipilot/KAT-Dev-72B-Exp

Quantized

(17)

this model