---
language:
- en
license: apache-2.0
tags:
- text-generation
- quantization
- nvfp4
- nvidia
- dgx-spark
- blackwell
- model_hub_mixin
- pytorch_model_hub_mixin
base_model: qingy2024/Qwen3-VLTO-32B-Instruct
inference: false
---

# Qwen3-VLTO-32B-Instruct-NVFP4

This is an NVFP4 quantized version of [qingy2024/Qwen3-VLTO-32B-Instruct](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct), optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs.

## Model Description

- **Base Model:** qingy2024/Qwen3-VLTO-32B-Instruct
- **Quantization Format:** NVFP4 (4-bit floating point)
- **Target Hardware:** NVIDIA DGX Spark (Grace Blackwell Superchip)
- **Quantization Tool:** NVIDIA TensorRT Model Optimizer v0.35.1
- **Model Size:** Approximately 20 GB (68% reduction from BF16)

## Performance Characteristics

### Memory Efficiency

| Model Version | Memory Usage | Reduction |
|--------------|--------------|-----------|
| BF16 (Original) | 61.03 GB | Baseline |
| NVFP4 (This model) | 19.42 GB | 68.2% |

### Inference Speed

| Model Version | Throughput | Relative Performance |
|--------------|------------|---------------------|
| BF16 (Original) | 3.65 tokens/s | Baseline |
| NVFP4 (This model) | 9.99 tokens/s | 2.74x faster |

**Test Configuration:**
- Hardware: NVIDIA DGX Spark GB10
- Framework: vLLM 0.10.2
- Max Model Length: 8192 tokens
- GPU Memory Utilization: 90%

## Quantization Details

### NVFP4 Format

NVFP4 is NVIDIA's 4-bit floating point quantization format featuring:
- **Two-level scaling:** E4M3 FP8 scaling per 16-value block + global FP32 tensor scale
- **Hardware acceleration:** Optimized for Tensor Cores on Blackwell GB10 GPUs
- **Group size:** 16
- **Minimal accuracy degradation:** Less than 1% vs original model
- **Excluded modules:** lm_head (kept in higher precision)

### Calibration

- **Dataset:** C4 (Colossal Clean Crawled Corpus)
- **Calibration samples:** 512
- **Maximum sequence length:** 2048 tokens
- **Method:** Post-training quantization with activation calibration

## Usage

### Requirements

- NVIDIA DGX Spark or compatible Blackwell GPU
- vLLM >= 0.6.5
- nvidia-modelopt[hf]

### Loading the Model

**IMPORTANT:** This model must be loaded with vLLM using the `modelopt` quantization parameter. Standard HuggingFace `AutoModelForCausalLM` will not work.

```python
from vllm import LLM, SamplingParams

# Load NVFP4 quantized model
llm = LLM(
    model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4",
    quantization="modelopt",  # Required for NVFP4
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params)
print(outputs[0].outputs[0].text)
```

### Environment Variables

You can optionally set:
- `HF_CACHE_DIR`: Override HuggingFace cache location

## Limitations

- **Hardware specific:** Optimized for NVIDIA Blackwell architecture (GB10)
- **vLLM required:** Cannot be loaded with standard transformers library
- **Quantization artifacts:** Minor precision loss (<1%) compared to BF16 original

## Intended Use

This model is intended for:
- High-throughput inference on NVIDIA DGX Spark systems
- Production deployments requiring memory-efficient models
- Research on quantization techniques for large language models

## Training and Quantization

### Base Model Training

See the [original model card](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for base model training details.

### Quantization Process

1. **Model Loading:** Original model loaded in BF16 precision
2. **Calibration:** 512 samples from C4 dataset for activation statistics
3. **Quantization:** NVFP4 format applied using NVIDIA modelopt
4. **Export:** Saved in HuggingFace safetensors format

**Quantization Time:** Approximately 60-90 minutes on DGX Spark

## Evaluation

### Test Results

All 5 inference tests passed successfully:
- Technical explanation generation
- Code generation
- Mathematical reasoning
- Creative writing
- Instruction following

**Average performance:** 9.99 tokens/s on DGX Spark GB10

## Citation

If you use this quantized model, please cite:

```bibtex
@misc{qwen3vlto32b-nvfp4,
  author = {Ex0bit},
  title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4}},
}
```

And the original base model:

```bibtex
@misc{qingy2024qwen3vlto,
  author = {qingy2024},
  title = {Qwen3-VLTO-32B-Instruct},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}},
}
```

## References

- [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- [vLLM Documentation](https://docs.vllm.ai/)
- [NVIDIA DGX Spark Documentation](https://docs.nvidia.com/dgx-spark/)
- [Quantization GitHub Repository](https://github.com/Ex0bit/nvfp4-quantization)

## License

This quantized model inherits the license from the base model. Please refer to the [original model's license](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for details.

## Model Card Authors

- Ex0bit (@Ex0bit)

## Acknowledgments

- NVIDIA for TensorRT Model Optimizer and DGX Spark hardware
- qingy2024 for the base Qwen3-VLTO-32B-Instruct model
- The vLLM team for high-performance inference framework