Qwen3-4B-NVFP4A16

Model Overview

  • Model Architecture: Qwen/Qwen3-4B
  • Input: Text
  • Output: Text
  • Model Optimizations:
    • Weight quantization: FP4 with per-group 16
    • Activation quantization: FP16
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 8/19/2025
  • Version: 1.0
  • Model Developers: 2imi9

This model is a quantized version of Qwen/Qwen3-4B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "2imi9/Qwen3-4B-NVFP4A16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

# Load model.
MODEL_ID = "Qwen/Qwen3-4B"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])

# Apply quantization.
oneshot(model=model, recipe=recipe)

print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")

# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Evaluation

This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, RULER-NIAH, and LiveCodeBench benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.

Accuracy

Category Metric Qwen/Qwen3-4B Qwen3-4B-NVFP4A16 (this model) Recovery (%)
General Knowledge MMLU-Redux (T) NA 82.1% NA
General Knowledge MMLU-Redux 83.7% 78.4% 93.7%
Mathematical Reasoning Math500 (T) 97.0% 92.8% 95.7%
Mathematical Reasoning Math500 73.8% 69.2% 93.8%
Instruction Following IFEval(Strict Prompt Level Acc) 81.9% 78.5% 95.8%
Long Context RULER-NIAH-32k NA 85.3% NA
Coding LiveCodeBench (2410-2502)(T) Pass@1 54.2% 48.7% 89.9%
Coding LiveCodeBench (2410-2502) Pass@1 16.71% 13.8% 82.6%

Reproduction

The results were obtained using the following commands: (on single gpu rtx5090)

MMLU-Redux

# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=mmlu-redux \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/mmlu_redux_nvfp4_4B_vllm \
    debug=false

Math500

# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=math500 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/math500_nvfp4_4B_vllm_thinking \
    debug=false

IFEval

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=ifeval \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/ifeval_nvfp4_4B_vllm \
    debug=false

RULER-NIAH-32k

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=ruler-niah-32k \
    eval_dataset_config=ruler-32k \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.7 \
    predictor_conf.vllm.max_num_seqs=1 \
    predictor_conf.vllm.max_num_batched_tokens=16384 \
    predictor_conf.vllm.max_seq_len=32768 \
    predictor_conf.vllm.enable_prefix_caching=false \
    +predictor_conf.vllm.cpu_offload_gb=8 \
    +predictor_conf.vllm.device=auto \
    output_dir=/app/outputs/ruler_niah_nvfp4_4B_vllm \
    debug=false

LiveCodeBench

# LiveCodeBench (2410-2502) with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_4B_bf16 \
    model_name=2imi9/Qwen3-4B-NVFP4A16 \
    eval_dataset=livecodebench \
    eval_dataset_config=livecodebench-2410-2502 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/livecodebench_nvfp4_4B_vllm_thinking \
    debug=false

Technical Details

  • Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
  • Excluded Layers: Language model head (lm_head) is not quantized
  • Memory Reduction: Approximately 75% reduction in model size
  • Inference Backend: Optimized for vLLM with tensor parallelism support
  • Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)

Configuration Notes

  • GPU memory utilization can be adjusted between 0.7-0.9 depending on available hardware
  • For long context evaluation (32k), reduced memory utilization (0.7) and CPU offload (8GB) are recommended for the 4B model
  • Prefix caching can be disabled for memory-constrained environments
  • Tensor parallel size of 1 is sufficient for the 4B parameter model on RTX 5090

Performance Characteristics

The NVFP4A16 quantization scheme provides:

  • High accuracy retention: 89-96% recovery across most benchmarks
  • Significant memory savings: ~75% reduction in model size (from ~8GB to ~2GB)
  • Improved inference speed: Especially beneficial for memory-bound scenarios
  • Hardware compatibility: Optimized for NVIDIA tensor cores with FP4 support

Use Cases

This quantized model is particularly well-suited for:

  • Edge deployment scenarios with limited GPU memory
  • High-throughput inference servers requiring memory efficiency
  • Development and testing environments
  • Multi-model serving where memory is at a premium

Limitations

  • Slight performance degradation compared to the full-precision model
  • Coding tasks show more significant impact (82.6% recovery on LiveCodeBench)
  • Requires compatible hardware and software stack for optimal performance
  • May not be suitable for applications requiring maximum accuracy
Downloads last month
3
Safetensors
Model size
2B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for 2imi9/Qwen3-4B-NVFP4A16

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(164)
this model