Logo

πŸ™ Github   |   πŸ“„ Paper

SINQ 4-bit Quantized Apertus-8B-2509 model

This repository contains the official 4-bit quantized version of the Apertus-8B-2509 model using the calibrated version of SINQ (Sinkhorn-Normalized Quantization) method.
SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.

To support the project please put a star ⭐ in the official SINQ github repository.

Model Details

  • Model Name: Apertus-8B-2509-4bit-SINQ
  • Base Model: swiss-ai/Apertus-8B-2509
  • Task: Text Generation
  • Framework: PyTorch / Transformers
  • License: Apache-2.0
  • Quantized By: Huawei - Computing Systems Lab

Quantization Details

  • Quantization Method: SINQ (Sinkhorn-Normalized Quantization)
  • Precision: INT4
  • Group Size: 64
  • Framework: PyTorch
  • Quantization Library: sinq

πŸš€ Usage

Prerequisite

  • Before running the quantization script, make sure the SINQ library is installed. Installation instructions and setup details are available in the SINQ official github repository.

  • For optimal inference speed, ensure that the GemLite library is installed.

Usage example

You can load and use the model with our wrapper based on the πŸ€— Transformers library:

from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel

model_name = "huawei-csl/Apertus-8B-2509-4bit-SINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
    model_name,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

# OPTIONAL: use it if you want to further increase the inference speed
sinq_model.forward(torch.tensor([[0]]).to(device))
sinq_model.forward = torch.compile(sinq_model.forward, dynamic=True, fullgraph=False, backend='inductor', mode='reduce-overhead')

template = """{% for m in messages -%}
{{ m['role'] }}: {{ m['content'] }}
{% endfor -%}
{% if add_generation_prompt %}assistant: {% endif %}"""

tokenizer.chat_template = template  # set once per tokenizer

# prepare the model input
prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(sinq_model.device)

# Generate the output
generated_ids = sinq_model.generate(**model_inputs, max_new_tokens=100)

# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

You can optionally compile the model’s forward pass using torch.compile, which can provide a significant speed boost (especially after the first run). Please consider that the first run will take longer because PyTorch compiles optimized kernels, but subsequent runs will be much faster.

🧩 Quantization Process

The quantized model was obtained using the SINQ quantization library, following the steps below:

from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

# Load base model
base_model_name = "swiss-ai/Apertus-8B-2509"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
    nbits=4,            # quantization bit-width
    group_size=64,     # group size
    tiling_mode="1D",   # tiling strategy
    method="sinq"       # quantization method ("asinq" for the calibrated version)
)

qmodel = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device="cuda:0"
)

Reproducibility Note: This model was quantized using the SINQ implementation from commit bbbc657 of the SINQ repository.



🧾 How to Cite This Work

If you find SINQ useful in your research or applications, please

  • Put a star ⭐ in the official SINQ github repository.
  • Cite our paper:
@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}
Downloads last month
11
Safetensors
Model size
5B params
Tensor type
BF16
Β·
F16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for huawei-csl/Apertus-8B-2509-4bit-SINQ

Quantized
(6)
this model

Collection including huawei-csl/Apertus-8B-2509-4bit-SINQ