Qwen2.5-Coder-32B-Instruct-FP8-dynamic
This is a version of Qwen/Qwen2.5-Coder-32B-Instruct quantized to FP8 (weights and dynamic activations) using llm-compressor.
This model format is particularly useful for accelerated inference with vLLM on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).
Model Description
Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the lm_head
layer kept in its original precision to maintain output quality.
Quantization with llm-compressor
The model was quantized using the oneshot
method from llm-compressor
with the FP8_DYNAMIC
scheme.
No calibration dataset was required for this quantization scheme.
The following script was used for conversion:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import os
# --- 1. Set the new Model ID ---
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"
# --- 2. Load model and tokenizer using Auto classes ---
print(f"Loading model: {MODEL_ID}...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=True,
)
# --- 3. The quantization recipe remains the same ---
print("Configuring FP8 quantization recipe...")
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
# Apply quantization. This step can take some time.
print("Applying one-shot quantization...")
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
print("Quantization complete.")
# --- 4. Confirm generation with the Qwen chat template ---
print("\n========== SAMPLE GENERATION ==============")
prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
messages = [
{"role": "system", "content": "You are a helpful assistant specialized in writing code."},
{"role": "user", "content": prompt}
]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
output_ids = model.generate(
**model_inputs,
max_new_tokens=256,
)
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(f"Generated Response:\n{response}")
print("==========================================")
# --- 5. Save the quantized model and the tokenizer correctly ---
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
print(f"\nSaving quantized model to {SAVE_DIR}...")
model.save_pretrained(SAVE_DIR)
print(f"Saving tokenizer to {SAVE_DIR}...")
tokenizer.save_pretrained(SAVE_DIR)
print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
Inference Example
This model can be loaded and run with transformers
, or for optimized FP8 inference, with vLLM.
Using transformers
(for functional checking, not FP8 optimized)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
# For Qwen models, it is recommended to use trust_remote_code=True
model = AutoModelForCausalLM.from_pretrained(
MODEL_REPO_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
MODEL_REPO_ID,
trust_remote_code=True
)
prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
messages = [
{"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
{"role": "user", "content": prompt}
]
# Apply the chat template to format the prompt correctly
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize the input and move to the device
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
# Generate output
output_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.6,
top_p=0.9
)
# Decode only the newly generated tokens
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print("--- Prompt ---")
print(prompt)
print("\n--- Qwen Response ---")
print(response)
Using vLLM (for optimized FP8 inference)
This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs. Prerequisites:
- A recent version of vLLM that supports compressed-tensors.
- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
- Docker and NVIDIA Container Toolkit installed.
Running with Docker (Recommended): The following command starts a vLLM OpenAI-compatible server with this quantized model:
# 1. Set your Hugging Face Token (optional, but recommended)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with a recent official build.
sudo docker run --gpus all \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:latest \
--model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
--tokenizer-mode auto \
--load-format auto \
--trust-remote-code \
--max-model-len 4096 # Optional: Adjust based on your VRAM
Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.
Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
- Downloads last month
- 151
Model tree for textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic
Base model
Qwen/Qwen2.5-32B