Highlights
KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks.
On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy ⚡ — when evaluated strictly with the SWE-agent scaffold. KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers.
KAT-Dev-72B-Exp-AWQ-INT4
Model Details
- Developed by: Kwaipilot
- Quantization by Thomas Whitworth
- Base model:
Kwaipilot/KAT-Dev-72B-Exp - Model type: Decoder-only transformer (Qwen2 architecture)
- Parameters: ~72 billion
- Quantization method: Activation-Aware Weight Quantization (AWQ)
- Weight precision: 4-bit integers (INT4)
- Activation precision: FP16 (W4A16)
- Format:
safetensorswith standard AWQ metadata (notcompressed_tensors) - Max context length: 131,072 tokens (via NTK-RoPE scaling)
- Intended inference engine: LMDeploy
Model Description
KAT-Dev-72B-Exp-AWQ-INT4 is a 4-bit quantized version of the open-source KAT-Dev-72B-Exp model, which is the experimental reinforcement-learning variant of the proprietary KAT-Coder system. The base model achieves 74.6% pass@1 accuracy on SWE-Bench Verified when evaluated with the official SWE-agent scaffold.
This quantized variant was created to enable efficient, long-context inference on accessible hardware (e.g., 2× RTX 3090/4090) while preserving near-original performance. It leverages AWQ to minimize accuracy loss and is optimized for LMDeploy, which supports:
- 4-bit AWQ weight dequantization
- Online 4-bit or 8-bit KV cache quantization
- Tensor-parallel inference across multiple GPUs
The model retains the base architecture’s innovations, including a rewritten attention kernel and a training engine optimized for shared-prefix RL trajectories. During RL training, advantage signals were reshaped based on pass rates to prevent exploration collapse—amplifying rewards for highly exploratory trajectories while dampening low-diversity ones.
Intended Use
This model is intended for:
- Researchers studying large-scale RL for code generation
- Engineers building long-context coding assistants or SWE-agents
- Developers who need a powerful coder model deployable on consumer GPUs
It is not recommended for:
- Tasks unrelated to software engineering or code reasoning
How to Use & install lmdeploy - LM Deploy
Initialize the project (creates pyproject.toml etc.)
uv init
Pin the project to Python 3.10
uv python pin 3.10
Create the venv (uv will use Python 3.10 because of the pin)
uv venv --python 3.10
add lmdeploy at version 0.10.1
uv add lmdeploy==0.10.1
verify installation was installed correctly
uv run python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
uv run python -c "import lmdeploy; print('lmdeploy imported')"
sync the environment (ensures the venv matches the lockfile)
uv sync
Inference using from CLI With 4Bit Kv Cache:
uv run lmdeploy serve api_server twhitworth/KAT-Dev-72B-Exp-AWQ-INT4-noct --backend turbomind --model-format awq --tp 4 --quant-policy 4 --enable_prefix_caching
Inference using Python:
from lmdeploy import pipeline, TurbomindEngineConfig
engine = TurbomindEngineConfig(
model_format='awq',
quant_policy=4, # 4-bit KV cache
session_len=131072,
tp=4 # 4 GPUs
enable_prefix_caching=True
)
pipe = pipeline("Kwaipilot/KAT-Dev-72B-Exp-AWQ-INT4", backend_config=engine)
response = pipe("Write a Python function to parse JSON with error handling.")
print(response.text)
For full deployment examples (SWE-agent, long-code analysis, API server), see the README.
Performance Evaluation
SWE-Bench Verified
- Base model (FP16): 74.6%
- This model (AWQ-INT4): ~73.2% (estimated, <2% drop)
Other Benchmarks (estimated from Qwen2-72B & InternLM2 studies)
| Benchmark | Base (FP16) | AWQ-INT4 | Δ |
|---|---|---|---|
| MMLU | 81.3 | 80.4 | -0.9 |
| GSM8K | 70.1 | 66.9 | -3.2 |
| HumanEval | 68.2% | 66.5% | -1.7 |
Degradation is minimal for code-related tasks and within acceptable bounds for most engineering applications.
Hardware & Memory Requirements
| Component | FP16 | INT8 | INT4 (this model) |
|---|---|---|---|
| Weights | ~144 GB | ~72 GB | ~36 GB |
| KV Cache @ 32k | ~80 GB | ~40 GB | ~20 GB |
| Total VRAM @ 32k | >220 GB | ~112 GB | ~56 GB |
For full deployment examples (SWE-agent, long-code analysis, API server), see the README.
Evaluation Settings (SWE-Agent)
The 74.6% SWE-Bench score was achieved using:
temperature: 0.6
max_turns: 150
history_processors.n: 100
License
- Model weights: Apache License 2.0
- Quantization: Research use permitted; commercial use requires verification
- Base model: Same as
Kwaipilot/KAT-Dev-72B-Exp(Apache 2.0)
Citation
If you use this model in your research, please cite the original KAT-Dev-72B-Exp:
@misc{kat2025dev72b,
author = {Kwaipilot},
title = {KAT-Dev-72B-Exp: A 72B Reinforcement-Learned Coder Model},
year = {2025},
howpublished = {\url{https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp}}
}
Acknowledgements
- Built on the Qwen2 architecture
- Quantized using Pytorch and LMDeploy
- Optimized for inference with LMDeploy TurboMind
- Inspired by SWE-Bench
- Downloads last month
- 65
Model tree for twhitworth/KAT-Dev-72B-Exp-AWQ-INT4-noct
Base model
Kwaipilot/KAT-Dev-72B-Exp