Kwaipilot

Highlights

KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks.

On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy ⚡ — when evaluated strictly with the SWE-agent scaffold. KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers.

Kim 2025-10-10 165138


KAT-Dev-72B-Exp-AWQ-INT4

Model Details

  • Developed by: Kwaipilot
  • Quantization by Thomas Whitworth
  • Base model: Kwaipilot/KAT-Dev-72B-Exp
  • Model type: Decoder-only transformer (Qwen2 architecture)
  • Parameters: ~72 billion
  • Quantization method: Activation-Aware Weight Quantization (AWQ)
  • Weight precision: 4-bit integers (INT4)
  • Activation precision: FP16 (W4A16)
  • Format: safetensors with standard AWQ metadata (not compressed_tensors)
  • Max context length: 131,072 tokens (via NTK-RoPE scaling)
  • Intended inference engine: LMDeploy

Model Description

KAT-Dev-72B-Exp-AWQ-INT4 is a 4-bit quantized version of the open-source KAT-Dev-72B-Exp model, which is the experimental reinforcement-learning variant of the proprietary KAT-Coder system. The base model achieves 74.6% pass@1 accuracy on SWE-Bench Verified when evaluated with the official SWE-agent scaffold.

This quantized variant was created to enable efficient, long-context inference on accessible hardware (e.g., 2× RTX 3090/4090) while preserving near-original performance. It leverages AWQ to minimize accuracy loss and is optimized for LMDeploy, which supports:

  • 4-bit AWQ weight dequantization
  • Online 4-bit or 8-bit KV cache quantization
  • Tensor-parallel inference across multiple GPUs

The model retains the base architecture’s innovations, including a rewritten attention kernel and a training engine optimized for shared-prefix RL trajectories. During RL training, advantage signals were reshaped based on pass rates to prevent exploration collapse—amplifying rewards for highly exploratory trajectories while dampening low-diversity ones.

Intended Use

This model is intended for:

  • Researchers studying large-scale RL for code generation
  • Engineers building long-context coding assistants or SWE-agents
  • Developers who need a powerful coder model deployable on consumer GPUs

It is not recommended for:

  • Tasks unrelated to software engineering or code reasoning

How to Use & install lmdeploy - LM Deploy

Initialize the project (creates pyproject.toml etc.)

uv init

Pin the project to Python 3.10

uv python pin 3.10

Create the venv (uv will use Python 3.10 because of the pin)

uv venv --python 3.10

add lmdeploy at version 0.10.1

uv add lmdeploy==0.10.1

verify installation was installed correctly

uv run python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
uv run python -c "import lmdeploy; print('lmdeploy imported')"

sync the environment (ensures the venv matches the lockfile)

uv sync

Inference using from CLI With 4Bit Kv Cache:

uv run lmdeploy serve api_server twhitworth/KAT-Dev-72B-Exp-AWQ-INT4-noct --backend turbomind --model-format awq --tp 4 --quant-policy 4 --enable_prefix_caching

Inference using Python:

from lmdeploy import pipeline, TurbomindEngineConfig

engine = TurbomindEngineConfig(
    model_format='awq',
    quant_policy=4,      # 4-bit KV cache
    session_len=131072,
    tp=4                 # 4 GPUs
    enable_prefix_caching=True
)

pipe = pipeline("Kwaipilot/KAT-Dev-72B-Exp-AWQ-INT4", backend_config=engine)
response = pipe("Write a Python function to parse JSON with error handling.")
print(response.text)

For full deployment examples (SWE-agent, long-code analysis, API server), see the README.

Performance Evaluation

SWE-Bench Verified

  • Base model (FP16): 74.6%
  • This model (AWQ-INT4): ~73.2% (estimated, <2% drop)

Other Benchmarks (estimated from Qwen2-72B & InternLM2 studies)

Benchmark Base (FP16) AWQ-INT4 Δ
MMLU 81.3 80.4 -0.9
GSM8K 70.1 66.9 -3.2
HumanEval 68.2% 66.5% -1.7

Degradation is minimal for code-related tasks and within acceptable bounds for most engineering applications.

Hardware & Memory Requirements

Component FP16 INT8 INT4 (this model)
Weights ~144 GB ~72 GB ~36 GB
KV Cache @ 32k ~80 GB ~40 GB ~20 GB
Total VRAM @ 32k >220 GB ~112 GB ~56 GB

For full deployment examples (SWE-agent, long-code analysis, API server), see the README.

Evaluation Settings (SWE-Agent)

The 74.6% SWE-Bench score was achieved using:

temperature: 0.6
max_turns: 150
history_processors.n: 100

License

  • Model weights: Apache License 2.0
  • Quantization: Research use permitted; commercial use requires verification
  • Base model: Same as Kwaipilot/KAT-Dev-72B-Exp (Apache 2.0)

Citation

If you use this model in your research, please cite the original KAT-Dev-72B-Exp:

@misc{kat2025dev72b,
  author = {Kwaipilot},
  title = {KAT-Dev-72B-Exp: A 72B Reinforcement-Learned Coder Model},
  year = {2025},
  howpublished = {\url{https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp}}
}

Acknowledgements

  • Built on the Qwen2 architecture
  • Quantized using Pytorch and LMDeploy
  • Optimized for inference with LMDeploy TurboMind
  • Inspired by SWE-Bench
Downloads last month
65
Safetensors
Model size
2B params
Tensor type
F16
·
I32
·
Video Preview
loading

Model tree for twhitworth/KAT-Dev-72B-Exp-AWQ-INT4-noct

Quantized
(17)
this model