File size: 17,211 Bytes
d578beb ff8307c d578beb ff8307c d578beb ff8307c d578beb ff8307c d578beb ff8307c d578beb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 |
---
library_name: transformers
license: mit
base_model: zai-org/GLM-4.6
tags:
- text-generation
- conversational
- awq
- quantized
- 4-bit
- vllm
- moe
- mixture-of-experts
- glm
- zhipu
language:
- en
- zh
pipeline_tag: text-generation
model_type: glm
quantization: awq
inference: false
datasets:
- neuralmagic/LLM_compression_calibration
---
# GLM-4.6-AWQ - Optimized 4-bit Quantization for Production Deployment
**High-performance AWQ quantization of ZHIPU AI's GLM-4.6 (357B MoE) optimized for vLLM inference**
[](https://huggingface.co/zai-org/GLM-4.6)
[](https://github.com/vllm-project/vllm)
[](https://github.com/mit-han-lab/llm-awq)
[](https://huggingface.co/bullpoint/GLM-4.6-AWQ)
## π Model Overview
This is a **professionally quantized 4-bit AWQ version** of [Z.ai's GLM-4.6](https://huggingface.co/zai-org/GLM-4.6) optimized for high-throughput production deployment with vLLM.
- **Base Model**: [GLM-4.6](https://huggingface.co/zai-org/GLM-4.6) (357B parameters, 160 experts MoE)
- **Model Size**: 176 GB (39 safetensors files)
- **License**: MIT (inherited from base model)
- **Quantization**: AWQ 4-bit with group size 128
- **Active Parameters**: 28.72B per token (8 of 160 experts)
- **Quantization Framework**: llmcompressor 0.8.1.dev0
- **Optimization**: Marlin kernels for NVIDIA GPUs
- **Context Length**: Up to 200K tokens (131K recommended for optimal performance)
- **Languages**: English, Chinese
## π Performance Benchmarks
Tested on **4Γ NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB each, 384GB total VRAM)**:
| Configuration | Throughput | VRAM/GPU | Total VRAM | Use Case |
|--------------|------------|----------|------------|----------|
| **With Expert Parallelism** | **~60 tok/s** | **~47GB** | **~188GB** | **Recommended: Multi-model deployment** |
| Without Expert Parallelism | ~65 tok/s | ~95GB | ~384GB | Single model, maximum speed |
### Performance Characteristics
- **Memory Bandwidth Efficiency**: 50.3% (excellent for MoE models)
- **Theoretical Maximum**: 130 tok/s (memory bandwidth bound)
- **Aggregate Bandwidth**: 1.7 TB/s effective (4Γ RTX PRO 6000 Blackwell Max-Q)
- **Actual vs Theoretical**: Typical for sparse MoE architecture
### Why AWQ Over Other Quantizations?
| Method | Accuracy | Speed | Disk Size | VRAM | Status |
|--------|----------|-------|-----------|------|--------|
| **AWQ 4-bit** | **Best** (indistinguishable from BF16) | **Fast** (Marlin kernels) | **176GB** | **188GB** | β
**This model** |
| GPTQ 4-bit | Lower (2Γ MMLU drop vs AWQ) | Similar | ~180GB | ~188GB | β οΈ Overfits calibration data |
| FP8 | Higher precision | 3.5Γ slower | ~330GB | ~330GB | β Unoptimized kernels |
| BF16 | Highest | N/A | ~714GB | 800GB+ | β Too large for most setups |
**Research shows**: AWQ has ~1 point MMLU drop while GPTQ has ~2 points. AWQ performance is indistinguishable from full BF16 on real-world benchmarks.
## πΎ VRAM Requirements
### Minimum Requirements (Expert Parallelism)
- **Model Download Size**: 176 GB
- **4Γ GPUs** with **48GB+ VRAM each** (192GB total minimum)
- **Recommended**: 4Γ 80GB GPUs or 4Γ 96GB GPUs
- **Memory Type**: HBM2e/HBM3/HBM3e for best performance
- **Disk Space**: 180+ GB for model storage
### Supported Configurations
| Setup | GPUs | VRAM/GPU | Total VRAM | Disk | Performance |
|-------|------|----------|------------|------|-------------|
| **Tested** | **4ΓRTX PRO 6000 Blackwell Max-Q (96GB)** | **~47GB** | **384GB** | **176GB** | **~60 tok/s** |
| Optimal | 4ΓH100 (80GB) | ~47GB | 320GB | 176GB | ~75-80 tok/s |
| Budget | 4ΓA100 (80GB) | ~47GB | 320GB | 176GB | ~50-55 tok/s |
| High-Speed | 2ΓH200 NVL | ~95GB | 192GB | 176GB | ~100+ tok/s |
## π οΈ Installation & Usage
### Prerequisites
```bash
pip install vllm>=0.11.0
# Or install from source for latest features
git clone https://github.com/vllm-project/vllm.git
cd vllm && pip install -e .
```
### Quick Start with vLLM
**Recommended Configuration (Expert Parallelism for Multi-Model Deployment):**
```bash
vllm serve <model_path> \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.6-awq \
--max-model-len 131072 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--port 8000
```
**Maximum Speed Configuration (Single Model):**
```bash
vllm serve <model_path> \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.6-awq \
--max-model-len 131072 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--port 8000
```
### Python API Usage
```python
from vllm import LLM, SamplingParams
# Initialize with expert parallelism (saves VRAM)
llm = LLM(
model="path/to/GLM-4.6-AWQ",
tensor_parallel_size=4,
enable_expert_parallel=True,
max_model_len=131072,
trust_remote_code=True,
gpu_memory_utilization=0.9
)
# Disable reasoning overhead for maximum speed
prompts = [
"Explain quantum computing in simple terms. /nothink",
"Write a Python function to calculate Fibonacci numbers. /nothink"
]
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=400
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
```
### OpenAI-Compatible API
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM doesn't require authentication
)
response = client.chat.completions.create(
model="glm-4.6-awq",
messages=[
{"role": "user", "content": "Explain quantum computing /nothink"}
],
max_tokens=400,
temperature=0.7
)
print(response.choices[0].message.content)
```
## π§ Quantization Details
### Technical Specifications
- **Method**: Activation-Aware Weight Quantization (AWQ)
- **Precision**: 4-bit signed integers
- **Group Size**: 128 (optimal balance of speed/accuracy)
- **Calibration Dataset**: [neuralmagic/LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) (512 samples)
- **Format**: Compressed-tensors with Marlin kernel support
- **Kernel**: MarlinLinearKernel + CompressedTensorsWNA16MarlinMoEMethod
### What Was Quantized?
- β
All 92 transformer decoder layers (layers 0-91)
- β
All 160 experts per layer (MoE experts)
- β
Attention projections (Q, K, V, O)
- β
MLP projections (gate, up, down)
- β LM head (kept at full precision for output quality)
- β MTP layer 92 (removed - incompatible with 4-bit quantization)
**Note on MTP (Multi-Token Prediction)**: The original GLM-4.6 includes a speculative decoding layer (layer 92) for drafting multiple tokens. This layer has been **intentionally removed** from this quantization because:
1. **4-bit precision is insufficient** for MTP to achieve acceptable draft token acceptance rates (0% acceptance observed)
2. **Adds 1.92GB VRAM** without providing speedup benefits
3. Research shows 8-bit or FP16 precision is required for effective MTP
### Quantization Process
This model was quantized using the following configuration:
```python
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from datasets import load_dataset
# Load calibration data from Neural Magic's curated dataset
dataset = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
dataset = dataset.shuffle(seed=42).select(range(512))
# Define ignore patterns and targets
ignore_patterns = [
"lm_head",
"model.embed_tokens",
"re:.*input_layernorm$",
"re:.*post_attention_layernorm$",
"model.norm",
"re:.*q_norm$",
"re:.*k_norm$",
"re:.*shared_experts.*",
"re:.*mlp\\.gate\\.weight$",
"re:.*mlp\\.gate\\..*bias$",
"re:model.layers.[0-2]\\.",
]
targets = [
"re:.*gate_proj.*",
"re:.*up_proj.*",
"re:.*down_proj.*",
"re:.*k_proj.*",
"re:.*q_proj.*",
"re:.*v_proj.*",
"re:.*o_proj.*",
]
# AWQ quantization recipe
recipe = [
AWQModifier(
ignore=ignore_patterns,
config_groups={
"group_0": {
"targets": targets,
"weights": {
"num_bits": 4,
"type": "int",
"symmetric": True,
"group_size": 128,
"strategy": "group",
"dynamic": False,
},
"input_activations": None,
"output_activations": None,
"format": None,
}
},
)
]
# Apply quantization
oneshot(
model=model, # Pre-loaded AutoModelForCausalLM
dataset=dataset,
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=512
)
```
## β‘ Performance Optimization Tips
### 1. Use `/nothink` for Maximum Speed
GLM-4.6 includes a reasoning mode that adds thinking overhead. Disable it for ~9% speedup:
```python
# Add /nothink to your prompts
prompt = "Your question here /nothink"
```
### 2. Enable Expert Parallelism
Distribute experts across GPUs to save VRAM for multi-model serving:
```bash
--enable-expert-parallel # Saves ~50GB total VRAM across 4 GPUs
```
### 3. Optimize Context Length
Longer context = more KV cache memory:
```bash
--max-model-len 131072 # Recommended (vs default 202752)
```
### 4. Tune Concurrent Requests
```bash
--max-num-seqs 1 # Minimum KV cache (single request at max context)
--max-num-seqs 64 # Higher throughput (multiple concurrent requests)
```
### 5. Monitor Memory Bandwidth
This model is **memory bandwidth bound**. Faster GPUs see proportional speedups:
- H100 (3.35 TB/s): ~120 tok/s
- H200 NVL (4.8 TB/s): ~165 tok/s
- RTX PRO 6000 Blackwell Max-Q (1.75 TB/s): ~60 tok/s
## π― Use Cases
### Recommended Applications
- β
**Production Chatbots**: Fast, accurate responses with minimal VRAM
- β
**Multi-Model Serving**: Expert parallelism enables running multiple models
- β
**Code Generation**: High accuracy maintained vs full precision
- β
**Reasoning Tasks**: Use default mode (without `/nothink`)
- β
**Long Context**: Supports up to 202K tokens
### Not Recommended For
- β **Speculative Decoding**: MTP layer removed (requires 8-bit+ precision)
- β **Extreme Precision Tasks**: Use FP8 or BF16 if accuracy is critical
- β **Single GPU Deployment**: Requires 4Γ GPUs minimum
## π Accuracy Benchmarks
AWQ quantization maintains excellent quality:
| Metric | BF16 Baseline | This AWQ 4-bit | GPTQ 4-bit | Difference |
|--------|---------------|----------------|------------|------------|
| MMLU | 100.0% | ~99.0% | ~98.0% | AWQ: -1%, GPTQ: -2% |
| Perplexity | Baseline | +2-3% | +5-8% | AWQ significantly better |
| Real Tasks | 100.0% | ~100.0% | 95-97% | AWQ indistinguishable |
**Key Finding**: Research shows AWQ performs indistinguishably from BF16 on real-world benchmarks, while GPTQ shows measurable degradation due to overfitting on calibration data.
## π¬ Technical Deep Dive
### Architecture
- **Type**: Mixture of Experts (MoE) Transformer
- **Total Parameters**: 357B (base model specification)
- **Experts**: 160 routed experts per layer
- **Active Experts**: 8 per token (5% utilization)
- **Layers**: 92 decoder layers
- **Heads**: 96 attention heads (8 KV heads)
- **Hidden Size**: 5120
- **Intermediate Size**: 12288 (dense), 1536 (MoE)
- **Vocabulary**: 151,552 tokens
- **Context Window**: 200K tokens (original spec)
### Memory Layout
| Component | Per GPU (EP) | Total (4 GPUs) | Percentage |
|-----------|--------------|----------------|------------|
| Model Weights | ~12GB | ~48GB | 25% |
| Expert Weights | ~28GB | ~112GB | 60% |
| KV Cache | ~5GB | ~20GB | 11% |
| Activation | ~2GB | ~8GB | 4% |
| **Total** | **~47GB** | **~188GB** | **100%** |
### Why Marlin Kernels?
Marlin is the state-of-the-art kernel for 4-bit quantized inference:
- **Speed**: 2-3Γ faster than CUDA native 4-bit
- **Efficiency**: Optimized for Ampere/Ada/Hopper/Blackwell architectures
- **Features**: Fused dequantization + GEMM operations
- **Support**: Integrated into vLLM for production use
## π Comparison to Other Models
| Model | Parameters | Disk Size | Quantization | Speed | VRAM | Accuracy |
|-------|------------|-----------|--------------|-------|------|----------|
| **GLM-4.6-AWQ** (this) | 357B | **176GB** | AWQ 4-bit | 60 tok/s | 188GB | Excellent |
| GLM-4.6-GPTQ | 357B | ~180GB | GPTQ 4-bit | 60 tok/s | 188GB | Good |
| GLM-4.6-FP8 | 357B | ~330GB | FP8 | 19 tok/s | 330GB | Better |
| GLM-4.6-BF16 | 357B | ~714GB | None | N/A | 800GB+ | Highest |
| DeepSeek-V3-AWQ | 671B | ~300GB | AWQ 4-bit | 45 tok/s | 250GB | Excellent |
| Qwen2.5-72B-AWQ | 72B | ~40GB | AWQ 4-bit | 120 tok/s | 48GB | Excellent |
## π Known Limitations
1. **Requires 4Γ GPUs**: Minimum deployment configuration
2. **No MTP Support**: Speculative decoding layer removed
3. **Memory Bandwidth Bound**: Speed scales with GPU memory bandwidth
4. **TP=4 Only**: Tested configuration (other TP sizes may work)
5. **vLLM Dependency**: Optimized specifically for vLLM runtime
## π Troubleshooting
### "KeyError: 'Linear'" Error
Run the fix script to add required config:
```bash
python fix_awq_config_for_vllm.py --model /path/to/GLM-4.6-AWQ
```
### Out of Memory Errors
1. Enable expert parallelism: `--enable-expert-parallel`
2. Reduce context length: `--max-model-len 65536`
3. Lower GPU utilization: `--gpu-memory-utilization 0.85`
4. Limit concurrent requests: `--max-num-seqs 1`
### Slow Inference
1. Check `/nothink` is appended to prompts
2. Verify Marlin kernels are active (check logs)
3. Monitor GPU utilization (`nvidia-smi dmon`)
4. Ensure NVLink is working between GPUs
## π Citation
If you use this quantized model, please cite:
```bibtex
@software{glm4_awq_2025,
title = {GLM-4.6-AWQ: Production-Optimized 4-bit Quantization},
author = {bullpoint},
year = {2025},
url = {https://huggingface.co/bullpoint/GLM-4.6-AWQ}
}
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv preprint arXiv:2306.00978},
year={2023}
}
@software{zai2025glm46,
title={GLM-4.6},
author={Z.ai and ZHIPU AI},
year={2025},
url={https://huggingface.co/zai-org/GLM-4.6},
license={MIT}
}
```
## π License
**MIT License** - This quantized model inherits the MIT license from the [original GLM-4.6 model](https://huggingface.co/zai-org/GLM-4.6).
You are free to:
- β
Use commercially
- β
Modify and distribute
- β
Use privately
- β
Sublicense
See the base model repository for full license terms.
## π Acknowledgments
- **Z.ai** for the original [GLM-4.6 model](https://huggingface.co/zai-org/GLM-4.6)
- **ZHIPU AI** for the GLM architecture and training
- **vLLM Team** for the excellent inference engine
- **MIT Han Lab** for the AWQ algorithm
- **Neural Magic** for:
- llm-compressor quantization toolkit
- [LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) calibration dataset
- **Community** for testing and feedback
## π§ Reproduction
Want to quantize this model yourself? See the included [`quantize_glm46_awq.py`](quantize_glm46_awq.py) script for the exact quantization configuration used.
### Quantization Hardware Requirements
This model was quantized on modest hardware with extensive CPU offloading:
- **GPU**: 1Γ NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB GDDR7)
- **RAM**: 768GB DDR5
- **Swap**: 300GB (actively used during quantization)
- **Quantization Time**: ~5 hours (includes calibration, smoothing, compression, and saving)
**Note**: The quantization process offloads the full BF16 model (~714GB) to system RAM/swap since it exceeds available VRAM. Using 4 GPUs during quantization provides **no speed benefit** - the process is CPU memory-bound, not GPU-bound. The included script defaults to single-GPU mode (`CUDA_VISIBLE_DEVICES=0`) for optimal resource usage.
### Key Settings
- Calibration dataset: [neuralmagic/LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration)
- Samples: 512
- Sequence length: 2048 tokens
- Group size: 128
- Bits: 4 (symmetric int)
- Device map: Sequential (CPU offloading enabled)
## π¬ Support
For issues and questions:
- **Model Issues**: Open an issue on this model's repository
- **vLLM Issues**: [vLLM GitHub](https://github.com/vllm-project/vllm/issues)
- **Quantization**: [llm-compressor GitHub](https://github.com/vllm-project/llm-compressor/issues)
---
**Status**: β
Production Ready | **Last Updated**: October 2025 | **Tested With**: vLLM 0.11.0+
|