Qwen3-4B-NVFP4A16
Model Overview
- Model Architecture: Qwen/Qwen3-4B
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: FP4 with per-group 16
- Activation quantization: FP16
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
- Release Date: 8/19/2025
- Version: 1.0
- Model Developers: 2imi9
This model is a quantized version of Qwen/Qwen3-4B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "2imi9/Qwen3-4B-NVFP4A16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
Creation
This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
# Load model.
MODEL_ID = "Qwen/Qwen3-4B"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])
# Apply quantization.
oneshot(model=model, recipe=recipe)
print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")
# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
Evaluation
This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, RULER-NIAH, and LiveCodeBench benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.
Accuracy
| Category | Metric | Qwen/Qwen3-4B | Qwen3-4B-NVFP4A16 (this model) | Recovery (%) |
|---|---|---|---|---|
| General Knowledge | MMLU-Redux (T) | NA | 82.1% | NA |
| General Knowledge | MMLU-Redux | 83.7% | 78.4% | 93.7% |
| Mathematical Reasoning | Math500 (T) | 97.0% | 92.8% | 95.7% |
| Mathematical Reasoning | Math500 | 73.8% | 69.2% | 93.8% |
| Instruction Following | IFEval(Strict Prompt Level Acc) | 81.9% | 78.5% | 95.8% |
| Long Context | RULER-NIAH-32k | NA | 85.3% | NA |
| Coding | LiveCodeBench (2410-2502)(T) Pass@1 | 54.2% | 48.7% | 89.9% |
| Coding | LiveCodeBench (2410-2502) Pass@1 | 16.71% | 13.8% | 82.6% |
Reproduction
The results were obtained using the following commands: (on single gpu rtx5090)
MMLU-Redux
# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_4B_bf16 \
model_name=2imi9/Qwen3-4B-NVFP4A16 \
eval_dataset=mmlu-redux \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/mmlu_redux_nvfp4_4B_vllm \
debug=false
Math500
# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_4B_bf16 \
model_name=2imi9/Qwen3-4B-NVFP4A16 \
eval_dataset=math500 \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/math500_nvfp4_4B_vllm_thinking \
debug=false
IFEval
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_4B_bf16 \
model_name=2imi9/Qwen3-4B-NVFP4A16 \
eval_dataset=ifeval \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/ifeval_nvfp4_4B_vllm \
debug=false
RULER-NIAH-32k
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_4B_bf16 \
model_name=2imi9/Qwen3-4B-NVFP4A16 \
eval_dataset=ruler-niah-32k \
eval_dataset_config=ruler-32k \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.7 \
predictor_conf.vllm.max_num_seqs=1 \
predictor_conf.vllm.max_num_batched_tokens=16384 \
predictor_conf.vllm.max_seq_len=32768 \
predictor_conf.vllm.enable_prefix_caching=false \
+predictor_conf.vllm.cpu_offload_gb=8 \
+predictor_conf.vllm.device=auto \
output_dir=/app/outputs/ruler_niah_nvfp4_4B_vllm \
debug=false
LiveCodeBench
# LiveCodeBench (2410-2502) with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_4B_bf16 \
model_name=2imi9/Qwen3-4B-NVFP4A16 \
eval_dataset=livecodebench \
eval_dataset_config=livecodebench-2410-2502 \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/livecodebench_nvfp4_4B_vllm_thinking \
debug=false
Technical Details
- Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
- Excluded Layers: Language model head (
lm_head) is not quantized - Memory Reduction: Approximately 75% reduction in model size
- Inference Backend: Optimized for vLLM with tensor parallelism support
- Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)
Configuration Notes
- GPU memory utilization can be adjusted between 0.7-0.9 depending on available hardware
- For long context evaluation (32k), reduced memory utilization (0.7) and CPU offload (8GB) are recommended for the 4B model
- Prefix caching can be disabled for memory-constrained environments
- Tensor parallel size of 1 is sufficient for the 4B parameter model on RTX 5090
Performance Characteristics
The NVFP4A16 quantization scheme provides:
- High accuracy retention: 89-96% recovery across most benchmarks
- Significant memory savings: ~75% reduction in model size (from ~8GB to ~2GB)
- Improved inference speed: Especially beneficial for memory-bound scenarios
- Hardware compatibility: Optimized for NVIDIA tensor cores with FP4 support
Use Cases
This quantized model is particularly well-suited for:
- Edge deployment scenarios with limited GPU memory
- High-throughput inference servers requiring memory efficiency
- Development and testing environments
- Multi-model serving where memory is at a premium
Limitations
- Slight performance degradation compared to the full-precision model
- Coding tasks show more significant impact (82.6% recovery on LiveCodeBench)
- Requires compatible hardware and software stack for optimal performance
- May not be suitable for applications requiring maximum accuracy
- Downloads last month
- 3