--- base_model: qwen2.5-vl license: mit pipeline_tag: image-text-to-text tags: - vision-language-model - multimodal - reasoning - fine-tuned - qwen library_name: transformers --- # DRIFT This is a fine-tuned version of Qwen2.5-VL for enhanced reasoning capabilities, specifically optimized for multimodal reasoning tasks. The model is presented in the paper [Directional Reasoning Injection for Fine-Tuning MLLMs](https://huggingface.co/papers/2510.15050). The code and further details can be found on the GitHub repository: https://github.com/WikiChao/DRIFT ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor import torch model_id = "ChaoHuangCS/DRIFT-VL-7B" # Load model and processor model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) # Example usage with an image from PIL import Image image = Image.open("your_image.jpg") prompt = "Analyze this image and explain your reasoning step by step." # Format the input messages = [ {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]} ] # Apply chat template text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = processor.process_vision_info(messages) inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt") with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512) response = processor.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Fine-tuning Details This model was fine-tuned using: - **Base Model**: Qwen2.5-VL - **Merged Model**: DeepSeek-R1 - **Training Method**: Custom reasoning-focused fine-tuning - **Dataset**: Multimodal reasoning datasets - **Architecture**: Preserves original Qwen2.5-VL architecture ## Performance The model has been optimized for: - Enhanced reasoning capabilities - Better multimodal understanding - Improved step-by-step thinking processes - More accurate visual question answering ## Citation If you use this model, please cite our paper. ## License This model is released under the MIT license.