Qwen3-4B-NVFP4A16

Model Overview

Model Architecture: Qwen/Qwen3-4B
Input: Text
Output: Text
Model Optimizations:
- Weight quantization: FP4 with per-group 16
- Activation quantization: FP16
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Release Date: 8/19/2025
Version: 1.0
Model Developers: 2imi9

This model is a quantized version of Qwen/Qwen3-4B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "2imi9/Qwen3-4B-NVFP4A16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

# Load model.
MODEL_ID = "Qwen/Qwen3-4B"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])

# Apply quantization.
oneshot(model=model, recipe=recipe)

print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")

# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Evaluation

This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, RULER-NIAH, and LiveCodeBench benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.

Accuracy

Category	Metric	Qwen/Qwen3-4B	Qwen3-4B-NVFP4A16 (this model)	Recovery (%)
General Knowledge	MMLU-Redux (T)	NA	82.1%	NA
General Knowledge	MMLU-Redux	83.7%	78.4%	93.7%
Mathematical Reasoning	Math500 (T)	97.0%	92.8%	95.7%
Mathematical Reasoning	Math500	73.8%	69.2%	93.8%
Instruction Following	IFEval(Strict Prompt Level Acc)	81.9%	78.5%	95.8%
Long Context	RULER-NIAH-32k	NA	85.3%	NA
Coding	LiveCodeBench (2410-2502)(T) Pass@1	54.2%	48.7%	89.9%
Coding	LiveCodeBench (2410-2502) Pass@1	16.71%	13.8%	82.6%

Reproduction

The results were obtained using the following commands: (on single gpu rtx5090)

MMLU-Redux

# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=mmlu-redux \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/mmlu_redux_nvfp4_4B_vllm \
    debug=false

Math500

# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=math500 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/math500_nvfp4_4B_vllm_thinking \
    debug=false

IFEval

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=ifeval \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/ifeval_nvfp4_4B_vllm \
    debug=false

RULER-NIAH-32k

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=ruler-niah-32k \
    eval_dataset_config=ruler-32k \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.7 \
    predictor_conf.vllm.max_num_seqs=1 \
    predictor_conf.vllm.max_num_batched_tokens=16384 \
    predictor_conf.vllm.max_seq_len=32768 \
    predictor_conf.vllm.enable_prefix_caching=false \
    +predictor_conf.vllm.cpu_offload_gb=8 \
    +predictor_conf.vllm.device=auto \
    output_dir=/app/outputs/ruler_niah_nvfp4_4B_vllm \
    debug=false

LiveCodeBench

# LiveCodeBench (2410-2502) with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=livecodebench \
    eval_dataset_config=livecodebench-2410-2502 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/livecodebench_nvfp4_4B_vllm_thinking \
    debug=false

Technical Details

Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
Excluded Layers: Language model head (lm_head) is not quantized
Memory Reduction: Approximately 75% reduction in model size
Inference Backend: Optimized for vLLM with tensor parallelism support
Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)

Configuration Notes

GPU memory utilization can be adjusted between 0.7-0.9 depending on available hardware
For long context evaluation (32k), reduced memory utilization (0.7) and CPU offload (8GB) are recommended for the 4B model
Prefix caching can be disabled for memory-constrained environments
Tensor parallel size of 1 is sufficient for the 4B parameter model on RTX 5090

Performance Characteristics

The NVFP4A16 quantization scheme provides:

High accuracy retention: 89-96% recovery across most benchmarks
Significant memory savings: ~75% reduction in model size (from ~8GB to ~2GB)
Improved inference speed: Especially beneficial for memory-bound scenarios
Hardware compatibility: Optimized for NVIDIA tensor cores with FP4 support

Use Cases

This quantized model is particularly well-suited for:

Edge deployment scenarios with limited GPU memory
High-throughput inference servers requiring memory efficiency
Development and testing environments
Multi-model serving where memory is at a premium

Limitations

Slight performance degradation compared to the full-precision model
Coding tasks show more significant impact (82.6% recovery on LiveCodeBench)
Requires compatible hardware and software stack for optimal performance
May not be suitable for applications requiring maximum accuracy

Downloads last month: 3

Safetensors

Model size

2B params

Tensor type

F32

BF16

F8_E4M3

Model tree for 2imi9/Qwen3-4B-NVFP4A16

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Quantized

(164)

this model