Qwen2.5-Coder-32B-Instruct-FP8-dynamic

This is a version of Qwen/Qwen2.5-Coder-32B-Instruct quantized to FP8 (weights and dynamic activations) using llm-compressor.

This model format is particularly useful for accelerated inference with vLLM on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).

Model Description

Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the lm_head layer kept in its original precision to maintain output quality.

Quantization with llm-compressor

The model was quantized using the oneshot method from llm-compressor with the FP8_DYNAMIC scheme. No calibration dataset was required for this quantization scheme.

The following script was used for conversion:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import os

# --- 1. Set the new Model ID ---
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"

# --- 2. Load model and tokenizer using Auto classes ---
print(f"Loading model: {MODEL_ID}...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

# --- 3. The quantization recipe remains the same ---
print("Configuring FP8 quantization recipe...")
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Apply quantization. This step can take some time.
print("Applying one-shot quantization...")
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
print("Quantization complete.")

# --- 4. Confirm generation with the Qwen chat template ---
print("\n========== SAMPLE GENERATION ==============")
prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
messages = [
    {"role": "system", "content": "You are a helpful assistant specialized in writing code."},
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

output_ids = model.generate(
    **model_inputs,
    max_new_tokens=256,
)

input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(f"Generated Response:\n{response}")
print("==========================================")


# --- 5. Save the quantized model and the tokenizer correctly ---
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
print(f"\nSaving quantized model to {SAVE_DIR}...")
model.save_pretrained(SAVE_DIR)

print(f"Saving tokenizer to {SAVE_DIR}...")
tokenizer.save_pretrained(SAVE_DIR)

print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")

Inference Example

This model can be loaded and run with transformers, or for optimized FP8 inference, with vLLM.

Using transformers (for functional checking, not FP8 optimized)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"

# For Qwen models, it is recommended to use trust_remote_code=True
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO_ID, 
    device_map="auto", 
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_REPO_ID,
    trust_remote_code=True
)

prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
messages = [
    {"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
    {"role": "user", "content": prompt}
]

# Apply the chat template to format the prompt correctly
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the input and move to the device
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

# Generate output
output_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024, 
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

# Decode only the newly generated tokens
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("--- Prompt ---")
print(prompt)
print("\n--- Qwen Response ---")
print(response)

Using vLLM (for optimized FP8 inference)

This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs. Prerequisites:

  • A recent version of vLLM that supports compressed-tensors.
  • A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
  • Docker and NVIDIA Container Toolkit installed.

Running with Docker (Recommended): The following command starts a vLLM OpenAI-compatible server with this quantized model:

# 1. Set your Hugging Face Token (optional, but recommended)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"

# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with a recent official build.
sudo docker run --gpus all \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    -p 8000:8000 \
    -e HF_TOKEN="$HF_TOKEN" \
    vllm/vllm-openai:latest \
    --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
    --tokenizer-mode auto \
    --load-format auto \
    --trust-remote-code \
    --max-model-len 4096 # Optional: Adjust based on your VRAM

Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.

Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)

For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct

Downloads last month
151
Safetensors
Model size
32.8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic

Base model

Qwen/Qwen2.5-32B
Quantized
(107)
this model