--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: apache-2.0 base_model: unsloth/Mistral-Small-3.2-24B-Instruct-2506 --- # Mistral-Small-3.2-24B-Instruct-2506-NVFP4 ## Model Overview - **Model Architecture:** unsloth/Mistral-Small-3.2-24B-Instruct-2506 - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - **Release Date:** 10/29/2025 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [unsloth/Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [unsloth/Mistral-Small-3.2-24B-Instruct-2506](https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights and activations of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM 1. Initialize vLLM server: ``` vllm serve RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 --tensor_parallel_size 1 --tokenizer_mode mistral ``` 2. Send requests to the server: ```python from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) model = "RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4" messages = [ {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, ] outputs = client.chat.completions.create( model=model, messages=messages, ) generated_text = outputs.choices[0].message.content print(generated_text) ``` ## Creation This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/llama3_example.py), as presented in the code snipet below.

```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.utils import dispatch_for_generation MODEL_ID = "unsloth/Mistral-Small-3.2-24B-Instruct-2506" # Load model. model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" # Select number of samples. 512 samples is a good place to start. # Increasing the number of samples can improve accuracy. NUM_CALIBRATION_SAMPLES = 512 MAX_SEQUENCE_LENGTH = 2048 # Load dataset and preprocess. ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]") ds = ds.shuffle(seed=42) def preprocess(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, ) } ds = ds.map(preprocess) # Tokenize inputs. def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) # Configure the quantization algorithm and scheme. # In this case, we: # * quantize the weights to fp4 with per group 16 via ptq # * calibrate a global_scale for activations, which will be used to # quantize activations to fp4 on the fly smoothing_strength = 0.9 recipe = [ SmoothQuantModifier(smoothing_strength=smoothing_strength), QuantizationModifier( ignore=["re:.*lm_head.*"], config_groups={ "group_0": { "targets": ["Linear"], "weights": { "num_bits": 4, "type": "float", "strategy": "tensor_group", "group_size": 16, "symmetric": True, "observer": "mse", }, "input_activations": { "num_bits": 4, "type": "float", "strategy": "tensor_group", "group_size": 16, "symmetric": True, "dynamic": "local", "observer": "minmax", }, } }, ) ] # Save to disk in compressed-tensors format. SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4" # Apply quantization. oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, output_dir=SAVE_DIR, ) print("\n\n") print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0])) print("==========================================\n\n") model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ```

## Evaluation This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). ### Accuracy

Category	Metric	unsloth/Mistral-Small-3.2-24B-Instruct-2506	RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4	Recovery
OpenLLM V1	arc_challenge	68.52	66.98	97.75
	gsm8k	89.61	87.11	97.21
	hellaswag	85.70	85.11	99.31
	mmlu	81.06	79.43	97.99
	truthfulqa_mc2	61.35	60.34	98.35
	winogrande	83.27	81.61	98.01
	Average	78.25	76.76	98.10
OpenLLM V2	BBH (3-shot)	65.86	64.05	97.25
	MMLU-Pro (5-shot)	50.84	48.45	95.30
	MuSR (0-shot)	39.15	40.21	102.71
	IFEval (0-shot)	84.05	84.41	100.43
	GPQA (0-shot)	33.14	32.55	98.22
	Math-\|v\|-5 (4-shot)	41.69	37.76	90.57
	Average	52.46	51.24	97.68
Coding	HumanEval_64 pass@2	88.88	88.84	99.95

### Reproduction The results were obtained using the following commands:

``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### OpenLLM v2 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks leaderboard \ --batch_size auto ``` #### HumanEval_64 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_64_instruct \ --batch_size auto ```