Qwen3-VLTO-32B-Instruct-NVFP4-256K

This is a 256K Context length, NVFP4 quantized version of qingy2024/Qwen3-VLTO-32B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs.

Model Description

Base Model: qingy2024/Qwen3-VLTO-32B-Instruct
Quantization Format: NVFP4 (4-bit floating point)
Target Hardware: NVIDIA DGX Spark (Grace Blackwell Superchip)
Quantization Tool: NVIDIA TensorRT Model Optimizer v0.35.1
Model Size: Approximately 20 GB (68% reduction from BF16)

Performance Characteristics

Memory Efficiency

Model Version	Memory Usage	Reduction
BF16 (Original)	61.03 GB	Baseline
NVFP4-256K (This model)	19.42 GB	68.2%

Inference Speed

Model Version	Throughput	Relative Performance
BF16 (Original)	3.65 tokens/s	Baseline
NVFP4-256K (This model)	9.99 tokens/s	2.74x faster

Test Configuration:

Hardware: NVIDIA DGX Spark GB10
Framework: vLLM 0.10.2
Max Model Length: 262144 tokens

Quantization Details

NVFP4 Format

NVFP4 is NVIDIA's 4-bit floating point quantization format featuring:

Two-level scaling: E4M3 FP8 scaling per 16-value block + global FP32 tensor scale
Hardware acceleration: Optimized for Tensor Cores on Blackwell GB10 GPUs
Group size: 16
Minimal accuracy degradation: Less than 1% vs original model
Excluded modules: lm_head (kept in higher precision)

Calibration

Dataset: C4 (Colossal Clean Crawled Corpus)
Calibration samples: 1024
Maximum sequence length: 262144 tokens
Method: Post-training quantization with activation calibration

Usage

Requirements

NVIDIA DGX Spark or compatible Blackwell GPU
vLLM >= 0.6.5
nvidia-modelopt[hf]

Loading the Model

IMPORTANT: This model must be loaded with vLLM using the modelopt quantization parameter. Standard HuggingFace AutoModelForCausalLM will not work.

from vllm import LLM, SamplingParams

# Load NVFP4 quantized model
llm = LLM(
    model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K",
    quantization="modelopt",  
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params)
print(outputs[0].outputs[0].text)

Environment Variables

You can optionally set:

HF_CACHE_DIR: Override HuggingFace cache location

Limitations

Hardware specific: Optimized for NVIDIA Blackwell architecture (GB10)
vLLM required: Cannot be loaded with standard transformers library
Quantization artifacts: Minor precision loss (<1%) compared to BF16 original

Intended Use

This model is intended for:

High-throughput inference on NVIDIA DGX Spark systems
Production deployments requiring memory-efficient models
Research on quantization techniques for large language models

Training and Quantization

Base Model Training

See the original model card for base model training details.

Quantization Process

Model Loading: Original model loaded in BF16 precision
Calibration: 1024 samples from C4 dataset for activation statistics
Quantization: NVFP4 format applied using NVIDIA modelopt
Export: Saved in HuggingFace safetensors format

Evaluation

Test Results

All 5 inference tests passed successfully:

Technical explanation generation
Code generation
Mathematical reasoning
Creative writing
Instruction following

Average performance: 9.99 tokens/s on DGX Spark GB10

Citation

If you use this quantized model, please cite:

@misc{qwen3vlto32b-nvfp4-256K,
  author = {Ex0bit},
  title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K}},
}

And the original base model:

@misc{qingy2024qwen3vlto,
  author = {qingy2024},
  title = {Qwen3-VLTO-32B-Instruct},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}},
}

References

License

This quantized model inherits the license from the base model. Please refer to the original model's license for details.

Model Card Authors

Ex0bit (@Ex0bit)

Acknowledgments

NVIDIA for TensorRT Model Optimizer and DGX Spark hardware
qingy2024 for the base Qwen3-VLTO-32B-Instruct model
The vLLM team for high-performance inference framework

Downloads last month: 25

Safetensors

Model size

17B params

Tensor type

BF16

F8_E4M3

Model tree for Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K

Base model

Qwen/Qwen3-VL-32B-Instruct

Finetuned

qingy2024/Qwen3-VLTO-32B-Instruct

Quantized

(7)

this model