--- license: apache-2.0 tags: - qwen - qwen2 - fp8 - quantization - llm-compressor - vllm - code-generation pipeline_tag: text-generation base_model: - Qwen/Qwen2.5-Coder-32B-Instruct --- # Qwen2.5-Coder-32B-Instruct-FP8-dynamic This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor). This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer). ## Model Description Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality. ## Quantization with llm-compressor The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme. No calibration dataset was required for this quantization scheme. The following script was used for conversion: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier import os # --- 1. Set the new Model ID --- MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct" # --- 2. Load model and tokenizer using Auto classes --- print(f"Loading model: {MODEL_ID}...") model = AutoModelForCausalLM.from_pretrained( MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True, ) print("Loading tokenizer...") tokenizer = AutoTokenizer.from_pretrained( MODEL_ID, trust_remote_code=True, ) # --- 3. The quantization recipe remains the same --- print("Configuring FP8 quantization recipe...") recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] ) # Apply quantization. This step can take some time. print("Applying one-shot quantization...") oneshot(model=model, recipe=recipe, tokenizer=tokenizer) print("Quantization complete.") # --- 4. Confirm generation with the Qwen chat template --- print("\n========== SAMPLE GENERATION ==============") prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic." messages = [ {"role": "system", "content": "You are a helpful assistant specialized in writing code."}, {"role": "user", "content": prompt} ] input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device) output_ids = model.generate( **model_inputs, max_new_tokens=256, ) input_token_len = model_inputs.input_ids.shape[1] generated_tokens = output_ids[0, input_token_len:] response = tokenizer.decode(generated_tokens, skip_special_tokens=True) print(f"Generated Response:\n{response}") print("==========================================") # --- 5. Save the quantized model and the tokenizer correctly --- SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" print(f"\nSaving quantized model to {SAVE_DIR}...") model.save_pretrained(SAVE_DIR) print(f"Saving tokenizer to {SAVE_DIR}...") tokenizer.save_pretrained(SAVE_DIR) print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'") ``` ## Inference Example This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/). ### Using `transformers` (for functional checking, not FP8 optimized) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic" # For Qwen models, it is recommended to use trust_remote_code=True model = AutoModelForCausalLM.from_pretrained( MODEL_REPO_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( MODEL_REPO_ID, trust_remote_code=True ) prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust." messages = [ {"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."}, {"role": "user", "content": prompt} ] # Apply the chat template to format the prompt correctly input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Tokenize the input and move to the device model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device) # Generate output output_ids = model.generate( **model_inputs, max_new_tokens=1024, do_sample=True, temperature=0.6, top_p=0.9 ) # Decode only the newly generated tokens input_token_len = model_inputs.input_ids.shape[1] generated_tokens = output_ids[0, input_token_len:] response = tokenizer.decode(generated_tokens, skip_special_tokens=True) print("--- Prompt ---") print(prompt) print("\n--- Qwen Response ---") print(response) ``` ### Using vLLM (for optimized FP8 inference) This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs. Prerequisites: - A recent version of vLLM that supports compressed-tensors. - A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer). - Docker and NVIDIA Container Toolkit installed. Running with Docker (Recommended): The following command starts a vLLM OpenAI-compatible server with this quantized model: ```bash # 1. Set your Hugging Face Token (optional, but recommended) # export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE" # 2. Run the vLLM Docker container. # Replace 'vllm/vllm-openai:latest' with a recent official build. sudo docker run --gpus all \ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ vllm/vllm-openai:latest \ --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \ --tokenizer-mode auto \ --load-format auto \ --trust-remote-code \ --max-model-len 4096 # Optional: Adjust based on your VRAM ``` Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests. ## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct) For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct