--- language: - en license: apache-2.0 tags: - quantization - sinq - int3 - efficient-inference - text-generation - llm - compression base_model: - swiss-ai/Apertus-8B-2509 base_model_relation: quantized ---

Logo

πŸ™ Github   |   πŸ“„ Paper

# SINQ 4-bit Quantized Apertus-8B-2509 model This repository contains the official **4-bit quantized** version of the [`Apertus-8B-2509`](https://huggingface.co/swiss-ai/Apertus-8B-2509) model using the *calibrated* version of **SINQ (Sinkhorn-Normalized Quantization)** method. SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact. To support the project please put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository. ## Model Details - **Model Name:** `Apertus-8B-2509-4bit-SINQ` - **Base Model:** [`swiss-ai/Apertus-8B-2509`](https://huggingface.co/swiss-ai/Apertus-8B-2509) - **Task:** Text Generation - **Framework:** PyTorch / Transformers - **License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Quantized By:** *Huawei - Computing Systems Lab* ## Quantization Details - **Quantization Method:** SINQ (Sinkhorn-Normalized Quantization) - **Precision:** INT4 - **Group Size:** 64 - **Framework:** PyTorch - **Quantization Library:** `sinq` --- # πŸš€ Usage ## Prerequisite - Before running the quantization script, make sure the **SINQ** library is installed. Installation instructions and setup details are available in the [SINQ official github repository](https://github.com/huawei-csl/SINQ). - For optimal inference speed, ensure that the [GemLite library](https://github.com/dropbox/gemlite) is installed. ## Usage example You can load and use the model with our wrapper based on the πŸ€— Transformers library: ```python from transformers import AutoTokenizer from sinq.patch_model import AutoSINQHFModel model_name = "huawei-csl/Apertus-8B-2509-4bit-SINQ" tokenizer = AutoTokenizer.from_pretrained(model_name) sinq_model = AutoSINQHFModel.from_quantized_safetensors( model_name, device="cuda:0", compute_dtype=torch.bfloat16 ) # OPTIONAL: use it if you want to further increase the inference speed sinq_model.forward(torch.tensor([[0]]).to(device)) sinq_model.forward = torch.compile(sinq_model.forward, dynamic=True, fullgraph=False, backend='inductor', mode='reduce-overhead') template = """{% for m in messages -%} {{ m['role'] }}: {{ m['content'] }} {% endfor -%} {% if add_generation_prompt %}assistant: {% endif %}""" tokenizer.chat_template = template # set once per tokenizer # prepare the model input prompt = "Give me a brief explanation of gravity in simple terms." messages_think = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages_think, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(sinq_model.device) # Generate the output generated_ids = sinq_model.generate(**model_inputs, max_new_tokens=100) # Get and decode the output output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :] print(tokenizer.decode(output_ids, skip_special_tokens=True)) ``` > You can optionally compile the model’s forward pass using torch.compile, which can provide a significant speed boost (especially after the first run). Please consider that the first run will take longer because PyTorch compiles optimized kernels, but subsequent runs will be much faster.
🧩 Quantization Process The quantized model was obtained using the **SINQ** quantization library, following the steps below: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from sinq.patch_model import AutoSINQHFModel from sinq.sinqlinear import BaseQuantizeConfig # Load base model base_model_name = "swiss-ai/Apertus-8B-2509" model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16") tokenizer = AutoTokenizer.from_pretrained(base_model_name) # Apply 4-bit SINQ quantization quant_cfg = BaseQuantizeConfig( nbits=4, # quantization bit-width group_size=64, # group size tiling_mode="1D", # tiling strategy method="sinq" # quantization method ("asinq" for the calibrated version) ) qmodel = AutoSINQHFModel.quantize_model( model, tokenizer=tokenizer, quant_config=quant_cfg, compute_dtype=torch.bfloat16, device="cuda:0" ) ``` > **Reproducibility Note**: This model was quantized using the SINQ implementation from commit [`bbbc657`](https://github.com/huawei-csl/SINQ/commit/bbbc6571e625ae6837813734abc94e776923d82f) of the [SINQ](https://github.com/huawei-csl/SINQ) repository.

--- # 🧾 How to Cite This Work If you find **SINQ** useful in your research or applications, please - Put a star ⭐ in the official [SINQ](https://github.com/huawei-csl/SINQ) github repository. - Cite our paper: ```bibtex @misc{muller2025sinq, title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli}, year={2025}, eprint={2509.22944}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={http://arxiv.org/abs/2509.22944} } ```