Qwen3-VLTO-32B-Instruct-NVFP4-256K
This is a 256K Context length, NVFP4 quantized version of qingy2024/Qwen3-VLTO-32B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs.
Model Description
- Base Model: qingy2024/Qwen3-VLTO-32B-Instruct
- Quantization Format: NVFP4 (4-bit floating point)
- Target Hardware: NVIDIA DGX Spark (Grace Blackwell Superchip)
- Quantization Tool: NVIDIA TensorRT Model Optimizer v0.35.1
- Model Size: Approximately 20 GB (68% reduction from BF16)
Performance Characteristics
Memory Efficiency
| Model Version | Memory Usage | Reduction |
|---|---|---|
| BF16 (Original) | 61.03 GB | Baseline |
| NVFP4-256K (This model) | 19.42 GB | 68.2% |
Inference Speed
| Model Version | Throughput | Relative Performance |
|---|---|---|
| BF16 (Original) | 3.65 tokens/s | Baseline |
| NVFP4-256K (This model) | 9.99 tokens/s | 2.74x faster |
Test Configuration:
- Hardware: NVIDIA DGX Spark GB10
- Framework: vLLM 0.10.2
- Max Model Length: 262144 tokens
Quantization Details
NVFP4 Format
NVFP4 is NVIDIA's 4-bit floating point quantization format featuring:
- Two-level scaling: E4M3 FP8 scaling per 16-value block + global FP32 tensor scale
- Hardware acceleration: Optimized for Tensor Cores on Blackwell GB10 GPUs
- Group size: 16
- Minimal accuracy degradation: Less than 1% vs original model
- Excluded modules: lm_head (kept in higher precision)
Calibration
- Dataset: C4 (Colossal Clean Crawled Corpus)
- Calibration samples: 1024
- Maximum sequence length: 262144 tokens
- Method: Post-training quantization with activation calibration
Usage
Requirements
- NVIDIA DGX Spark or compatible Blackwell GPU
- vLLM >= 0.6.5
- nvidia-modelopt[hf]
Loading the Model
IMPORTANT: This model must be loaded with vLLM using the modelopt quantization parameter. Standard HuggingFace AutoModelForCausalLM will not work.
from vllm import LLM, SamplingParams
# Load NVFP4 quantized model
llm = LLM(
model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K",
quantization="modelopt",
trust_remote_code=True,
gpu_memory_utilization=0.9
)
# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params)
print(outputs[0].outputs[0].text)
Environment Variables
You can optionally set:
HF_CACHE_DIR: Override HuggingFace cache location
Limitations
- Hardware specific: Optimized for NVIDIA Blackwell architecture (GB10)
- vLLM required: Cannot be loaded with standard transformers library
- Quantization artifacts: Minor precision loss (<1%) compared to BF16 original
Intended Use
This model is intended for:
- High-throughput inference on NVIDIA DGX Spark systems
- Production deployments requiring memory-efficient models
- Research on quantization techniques for large language models
Training and Quantization
Base Model Training
See the original model card for base model training details.
Quantization Process
- Model Loading: Original model loaded in BF16 precision
- Calibration: 1024 samples from C4 dataset for activation statistics
- Quantization: NVFP4 format applied using NVIDIA modelopt
- Export: Saved in HuggingFace safetensors format
Evaluation
Test Results
All 5 inference tests passed successfully:
- Technical explanation generation
- Code generation
- Mathematical reasoning
- Creative writing
- Instruction following
Average performance: 9.99 tokens/s on DGX Spark GB10
Citation
If you use this quantized model, please cite:
@misc{qwen3vlto32b-nvfp4-256K,
author = {Ex0bit},
title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K}},
}
And the original base model:
@misc{qingy2024qwen3vlto,
author = {qingy2024},
title = {Qwen3-VLTO-32B-Instruct},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}},
}
References
- NVIDIA TensorRT Model Optimizer
- vLLM Documentation
- NVIDIA DGX Spark Documentation
- Quantization GitHub Repository
License
This quantized model inherits the license from the base model. Please refer to the original model's license for details.
Model Card Authors
- Ex0bit (@Ex0bit)
Acknowledgments
- NVIDIA for TensorRT Model Optimizer and DGX Spark hardware
- qingy2024 for the base Qwen3-VLTO-32B-Instruct model
- The vLLM team for high-performance inference framework
- Downloads last month
- 25
Model tree for Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K
Base model
Qwen/Qwen3-VL-32B-Instruct