--- license: mit library_name: transformers pipeline_tag: text-generation tags: - vLLM - AWQ base_model: - deepseek-ai/DeepSeek-V3.2-Exp base_model_relation: quantized --- # DeepSeek-V3.2-Exp-AWQ-Lite Base model: [DeepSeek-V3.2-Exp](https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.2-Exp) ### 【Dependencies / Installation】 As of **2025-09-30**, create a fresh Python environment and run: ```bash pip install -U pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly pip install https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl ``` For more details, please refer to vLLM documentation [[link]](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html) ### 【vLLM Startup Command - Single Node with 6 GPUs】 Note, as of **2025-09-30**: 1. Only Hopper and Blackwell data center GPUs are supported for now. 2. The kernels are mainly optimized for TP=1, so it is recommended to run this model under DP/EP mode 3. DP mode may result in increased serving latency. To mitigate this, we recommend enabling MTP to maintain optimal speed. To disable MTP, simply remove the `--speculative-config` flag. 4. Some users have observed improved performance on H20 machines by setting `export VLLM_USE_DEEP_GEMM=0` ``` CONTEXT_LENGTH=32768 vllm serve \ tclf90/DeepSeek-V3.2-Exp-AWQ-Lite \ --served-model-name MY_MODEL \ --data-parallel-size 6 \ --enable-expert-parallel \ --speculative-config '{"model": "tclf90/DeepSeek-V3.2-Exp-AWQ-Lite", "method": "deepseek_mtp", "num_speculative_tokens": 1}' \ --enable-auto-tool-choice \ --tool-call-parser deepseek_v31 \ --swap-space 16 \ --max-num-seqs 32 \ --max-model-len $CONTEXT_LENGTH \ --gpu-memory-utilization 0.99 \ --trust-remote-code \ --disable-log-requests \ --host 0.0.0.0 \ --port 8000 ``` ### 【Logs】 ``` 2025-10-01 1. Initial commit ``` ### 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `339GB` | `2025-10-01` | ### 【Model Download】 ```python from huggingface_hub import snapshot_download snapshot_download('tclf90/DeepSeek-V3.2-Exp-AWQ-Lite', cache_dir="your_local_path") ``` ### 【Overview】 # DeepSeek-V3.2-Exp

## Introduction We are excited to announce the official release of DeepSeek-V3.2-Exp, an experimental version of our model. As an intermediate step toward our next-generation architecture, V3.2-Exp builds upon V3.1-Terminus by introducing DeepSeek Sparse Attention—a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios. This experimental release represents our ongoing research into more efficient transformer architectures, particularly focusing on improving computational efficiency when processing extended text sequences.

- DeepSeek Sparse Attention (DSA) achieves fine-grained sparse attention for the first time, delivering substantial improvements in long-context training and inference efficiency while maintaining virtually identical model output quality. - To rigorously evaluate the impact of introducing sparse attention, we deliberately aligned the training configurations of DeepSeek-V3.2-Exp with V3.1-Terminus. Across public benchmarks in various domains, DeepSeek-V3.2-Exp demonstrates performance on par with V3.1-Terminus. | Benchmark | DeepSeek-V3.1-Terminus | DeepSeek-V3.2-Exp | | :--- | :---: | :---: | | **Reasoning Mode w/o Tool Use** | | | | MMLU-Pro | 85.0 | 85.0 | | GPQA-Diamond | 80.7 | 79.9 | | Humanity's Last Exam | 21.7 | 19.8 | | LiveCodeBench | 74.9 | 74.1 | | AIME 2025 | 88.4 | 89.3 | | HMMT 2025 | 86.1 | 83.6 | | Codeforces | 2046 | 2121 | | Aider-Polyglot | 76.1 | 74.5 | | **Agentic Tool Use** | | | | BrowseComp | 38.5 | 40.1 | | BrowseComp-zh | 45.0 | 47.9 | | SimpleQA | 96.8 | 97.1 | | SWE Verified | 68.4 | 67.8 | | SWE-bench Multilingual | 57.8 | 57.9 | | Terminal-bench | 36.7 | 37.7 | ## How to Run Locally ### HuggingFace We provide an updated inference demo code in the [inference](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/tree/main/inference) folder to help the community quickly get started with our model and understand its architectural details. First convert huggingface model weights to the the format required by our inference demo. Set `MP` to match your available GPU count: ```bash cd inference export EXPERTS=256 python convert.py --hf-ckpt-path ${HF_CKPT_PATH} --save-path ${SAVE_PATH} --n-experts ${EXPERTS} --model-parallel ${MP} ``` Launch the interactive chat interface and start exploring DeepSeek's capabilities: ```bash export CONFIG=config_671B_v3.2.json torchrun --nproc-per-node ${MP} generate.py --ckpt-path ${SAVE_PATH} --config ${CONFIG} --interactive ``` ### SGLang #### Installation with Docker ``` # H200 docker pull lmsysorg/sglang:dsv32 # MI350 docker pull lmsysorg/sglang:dsv32-rocm # NPUs docker pull lmsysorg/sglang:dsv32-a2 docker pull lmsysorg/sglang:dsv32-a3 ``` #### Launch Command ```bash python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --page-size 64 ``` ### vLLM vLLM provides day-0 support of DeepSeek-V3.2-Exp. See the [recipes](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html) for up-to-date details. ## Open-Source Kernels For TileLang kernels with **better readability and research-purpose design**, please refer to [TileLang](https://github.com/tile-ai/tilelang/tree/main/examples/deepseek_v32). For **high-performance CUDA kernels**, indexer logit kernels (including paged versions) are available in [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM/pull/200). Sparse attention kernels are released in [FlashMLA](https://github.com/deepseek-ai/FlashMLA/pull/98). ## License This repository and the model weights are licensed under the [MIT License](LICENSE). ## Citation ``` @misc{deepseekai2024deepseekv32, title={DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention}, author={DeepSeek-AI}, year={2025}, } ``` ## Contact If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com).