Upload folder using huggingface_hub

8d782c9 verified 4 months ago

7.91 kB

	---
	library_name: vllm
	license: apache-2.0
	language:
	- en
	- fr
	- es
	- it
	- pt
	- zh
	- ar
	- ru
	base_model:
	- HuggingFaceTB/SmolLM3-3B
	tags:
	- neuralmagic
	- redhat
	- llmcompressor
	- int4
	- w4a16
	- quantized
	---

	## Model Overview
	- Model Architecture: SmolLM3-3B
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Weight quantization: INT4
	- Activation quantization: None
	- Release Date: 07/31/2025
	- Version: 1.0
	- License(s): Apache-2.0
	- Model Developers: RedHat (Neural Magic)

	### Model Optimizations

	This model was obtained by quantizing weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to INT4 data type.
	This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%).
	Weight quantization also reduces disk size requirements by approximately 75%.
	Only weights of the linear operators within transformers blocks are quantized.
	The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.

	## Deployment

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	model_id = "RedHatAI/SmolLM3-3B-quantized.w4a16"
	number_gpus = 1

	sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

	tokenizer = AutoTokenizer.from_pretrained(model_id)

	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": "Who are you?"},
	]

	prompts = tokenizer.apply_chat_template(messages, tokenize=False)

	llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

	outputs = llm.generate(prompts, sampling_params)

	generated_text = outputs[0].outputs[0].text
	print(generated_text)
	```

	vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.


	## Creation

	<details>
	<summary>Creation details</summary>
	This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below with:

	```bash
	python int4.py --model_path HuggingFaceTB/SmolLM3-3B --calib_size 1024 --dampening_frac 0.1 --observer minmax --actorder group --sym false
	```
	where `int4.py` is as follows:


	```python
	import argparse
	from datasets import load_dataset
	from transformers import AutoTokenizer, AutoModelForCausalLM

	from compressed_tensors.quantization import (
	QuantizationScheme,
	QuantizationArgs,
	QuantizationType,
	QuantizationStrategy,
	)
	from llmcompressor.modifiers.quantization import GPTQModifier
	from llmcompressor.transformers import oneshot

	# Constants
	DATASET_ID = "neuralmagic/LLM_compression_calibration"
	DATASET_SPLIT = "train"
	MAX_SEQ_LENGTH = 8192
	IGNORE_MODULES = ["lm_head"]

	# Argument Parsing Utilities
	def parse_actorder(value: str):
	value_lower = value.lower()
	if value_lower == "false":
	return False
	if value_lower in {"weight", "group"}:
	return value_lower
	raise argparse.ArgumentTypeError(f"Invalid --actorder. Choose 'group', 'weight', or 'false', got {value}")

	def parse_sym(value: str):
	value_lower = value.lower()
	if value_lower in {"true", "false"}:
	return value_lower == "true"
	raise argparse.ArgumentTypeError(f"Invalid --sym. Use 'true' or 'false', got {value}")

	# Argument Parser
	def get_args():
	parser = argparse.ArgumentParser(description="Quantize a model with GPTQModifier.")
	parser.add_argument('--model_path', type=str, required=True, help="Path to the unquantized model.")
	parser.add_argument('--calib_size', type=int, default=256, help="Number of samples for calibration.")
	parser.add_argument('--dampening_frac', type=float, default=0.1, help="Dampening fraction for quantization.")
	parser.add_argument('--observer', type=str, default="minmax", help="Observer type used for quantization.")
	parser.add_argument('--sym', type=parse_sym, default=True, help="Symmetric quantization (true/false).")
	parser.add_argument('--actorder', type=parse_actorder, default=False,
	help="Activation order: 'group', 'weight', or 'false'.")
	return parser.parse_args()

	def main():
	args = get_args()

	model = AutoModelForCausalLM.from_pretrained(
	args.model_path,
	device_map="auto",
	torch_dtype="auto",
	use_cache=False,
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(args.model_path)

	# Load and preprocess dataset
	ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
	ds = ds.shuffle(seed=42).select(range(args.calib_size))
	ds = ds.map(lambda x: {"text": x["text"]})
	ds = ds.map(
	lambda x: tokenizer(x["text"], padding=False, truncation=False, add_special_tokens=True),
	remove_columns=ds.column_names
	)

	# Build Quantization Scheme
	quant_scheme = QuantizationScheme(
	targets=["Linear"],
	weights=QuantizationArgs(
	num_bits=4,
	type=QuantizationType.INT,
	symmetric=args.sym,
	group_size=128,
	strategy=QuantizationStrategy.GROUP,
	observer=args.observer,
	actorder=args.actorder
	),
	input_activations=None,
	output_activations=None,
	)

	# Define compression recipe
	recipe = [
	GPTQModifier(
	targets=["Linear"],
	ignore=IGNORE_MODULES,
	dampening_frac=args.dampening_frac,
	config_groups={"group_0": quant_scheme},
	)
	]

	# Apply quantization
	oneshot(
	model=model,
	dataset=ds,
	recipe=recipe,
	num_calibration_samples=args.calib_size,
	max_seq_length=MAX_SEQ_LENGTH,
	)

	# Save the quantized model
	save_path = f"{args.model_path}-quantized.w4a16"
	model.save_pretrained(save_path, save_compressed=True)
	tokenizer.save_pretrained(save_path)

	if __name__ == "__main__":
	main()
	```

	</details>

	## Evaluation

	This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond.
	In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine, and evals are collected through [LightEval](https://github.com/huggingface/lighteval) library.


	<details>
	<summary>Evaluation details</summary>

	```
	export VLLM_WORKER_MULTIPROC_METHOD=spawn
	export MODEL="RedHatAI/SmolLM3-3B-quantized.w4a16"
	export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

	export TASK=aime24 # {aime24, math_500, gpqa:diamond}

	lighteval vllm $MODEL_ARGS "lighteval\|${TASK}\|0\|0" \
	--use-chat-template \
	--output-dir out_dir
	```
	</details>

	### Accuracy

	<table>
	<tr>
	<th>Category
	</th>
	<th>Benchmark
	</th>
	<th>HuggingFaceTB/SmolLM3-3B
	</th>
	<th>RedHatAI/SmolLM3-3B-quantized.w4a16<br>(this model)
	</th>
	<th>Recovery
	</th>
	</tr>
	<tr>
	<td rowspan="8" ><strong>Reasoning</strong>
	</td>
	<td>AIME24 (pass@1:64)
	</td>
	<td>45.31
	</td>
	<td>39.27
	</td>
	<td>86.67%
	</td>
	</tr>
	<tr>
	<td>MATH-500 (pass@1:4)
	</td>
	<td>89.30
	</td>
	<td>87.55
	</td>
	<td>98.04%
	</td>
	</tr>
	<tr>
	<td>GPQA-Diamond (pass@1:8)
	</td>
	<td>41.22
	</td>
	<td>41.86
	</td>
	<td>101.55%
	</td>
	</tr>
	<tr>
	<td><strong>Average</strong>
	</td>
	<td><strong>58.61</strong>
	</td>
	<td><strong>56.23</strong>
	</td>
	<td><strong>95.94%</strong>
	</td>
	</tr>
	<tr>
	</table>