Qwen3-VLTO-32B-Instruct-NVFP4-256K

This is a 256K Context length, NVFP4 quantized version of qingy2024/Qwen3-VLTO-32B-Instruct, optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs.

Model Description

  • Base Model: qingy2024/Qwen3-VLTO-32B-Instruct
  • Quantization Format: NVFP4 (4-bit floating point)
  • Target Hardware: NVIDIA DGX Spark (Grace Blackwell Superchip)
  • Quantization Tool: NVIDIA TensorRT Model Optimizer v0.35.1
  • Model Size: Approximately 20 GB (68% reduction from BF16)

Performance Characteristics

Memory Efficiency

Model Version Memory Usage Reduction
BF16 (Original) 61.03 GB Baseline
NVFP4-256K (This model) 19.42 GB 68.2%

Inference Speed

Model Version Throughput Relative Performance
BF16 (Original) 3.65 tokens/s Baseline
NVFP4-256K (This model) 9.99 tokens/s 2.74x faster

Test Configuration:

  • Hardware: NVIDIA DGX Spark GB10
  • Framework: vLLM 0.10.2
  • Max Model Length: 262144 tokens

Quantization Details

NVFP4 Format

NVFP4 is NVIDIA's 4-bit floating point quantization format featuring:

  • Two-level scaling: E4M3 FP8 scaling per 16-value block + global FP32 tensor scale
  • Hardware acceleration: Optimized for Tensor Cores on Blackwell GB10 GPUs
  • Group size: 16
  • Minimal accuracy degradation: Less than 1% vs original model
  • Excluded modules: lm_head (kept in higher precision)

Calibration

  • Dataset: C4 (Colossal Clean Crawled Corpus)
  • Calibration samples: 1024
  • Maximum sequence length: 262144 tokens
  • Method: Post-training quantization with activation calibration

Usage

Requirements

  • NVIDIA DGX Spark or compatible Blackwell GPU
  • vLLM >= 0.6.5
  • nvidia-modelopt[hf]

Loading the Model

IMPORTANT: This model must be loaded with vLLM using the modelopt quantization parameter. Standard HuggingFace AutoModelForCausalLM will not work.

from vllm import LLM, SamplingParams

# Load NVFP4 quantized model
llm = LLM(
    model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K",
    quantization="modelopt",  
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params)
print(outputs[0].outputs[0].text)

Environment Variables

You can optionally set:

  • HF_CACHE_DIR: Override HuggingFace cache location

Limitations

  • Hardware specific: Optimized for NVIDIA Blackwell architecture (GB10)
  • vLLM required: Cannot be loaded with standard transformers library
  • Quantization artifacts: Minor precision loss (<1%) compared to BF16 original

Intended Use

This model is intended for:

  • High-throughput inference on NVIDIA DGX Spark systems
  • Production deployments requiring memory-efficient models
  • Research on quantization techniques for large language models

Training and Quantization

Base Model Training

See the original model card for base model training details.

Quantization Process

  1. Model Loading: Original model loaded in BF16 precision
  2. Calibration: 1024 samples from C4 dataset for activation statistics
  3. Quantization: NVFP4 format applied using NVIDIA modelopt
  4. Export: Saved in HuggingFace safetensors format

Evaluation

Test Results

All 5 inference tests passed successfully:

  • Technical explanation generation
  • Code generation
  • Mathematical reasoning
  • Creative writing
  • Instruction following

Average performance: 9.99 tokens/s on DGX Spark GB10

Citation

If you use this quantized model, please cite:

@misc{qwen3vlto32b-nvfp4-256K,
  author = {Ex0bit},
  title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K}},
}

And the original base model:

@misc{qingy2024qwen3vlto,
  author = {qingy2024},
  title = {Qwen3-VLTO-32B-Instruct},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}},
}

References

License

This quantized model inherits the license from the base model. Please refer to the original model's license for details.

Model Card Authors

  • Ex0bit (@Ex0bit)

Acknowledgments

  • NVIDIA for TensorRT Model Optimizer and DGX Spark hardware
  • qingy2024 for the base Qwen3-VLTO-32B-Instruct model
  • The vLLM team for high-performance inference framework
Downloads last month
25
Safetensors
Model size
17B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4-256K

Quantized
(7)
this model