--- base_model: Qwen/Qwen3-VL-8B-Instruct library_name: transformers model_name: Qwen3-VL-8B-german-shorthand tags: - generated_from_trainer - sft - trl - vision-language - ocr - transcription - medieval - german - shorthand licence: license --- # Model Card for Qwen3-VL-8B-german-shorthand This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) for transcribing medieval German shorthand from images. It has been trained using [TRL](https://github.com/huggingface/trl) on the [wjbmattingly/german-shorthand-window-5-4](https://huggingface.co/datasets/wjbmattingly/medieval-data-german-shorthand-window-5-4) dataset. ## Model Description This vision-language model specializes in transcribing text from images of German shorthand documents. Given an image of shorthand text, the model generates the corresponding transcription. ## Quick start ```python from transformers import AutoProcessor, Qwen3VLForConditionalGeneration from peft import PeftModel from PIL import Image # Load model and processor base_model = "Qwen/Qwen3-VL-8B-Instruct" adapter_model = "wjbmattingly/Qwen3-VL-8B-german-shorthand" model = Qwen3VLForConditionalGeneration.from_pretrained( base_model, torch_dtype="auto", device_map="auto" ) model = PeftModel.from_pretrained(model, adapter_model) processor = AutoProcessor.from_pretrained(base_model) # Load your image image = Image.open("path/to/your/shorthand_image.jpg") # Prepare the message messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Transcribe the text shown in this image."}, ], }, ] # Generate transcription inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ).to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=256) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] transcription = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(transcription) ``` ## Training procedure This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-8B-Instruct base model. ### Training Data The model was trained on [wjbmattingly/german-shorthand-window-5-4](https://huggingface.co/datasets/wjbmattingly/german-shorthand-window-5-4), a dataset containing images of German shorthand with corresponding text transcriptions. ### Training Configuration - **Base Model**: Qwen/Qwen3-VL-8B-Instruct - **Training Method**: Supervised Fine-Tuning (SFT) with LoRA - **LoRA Configuration**: - Rank (r): 16 - Alpha: 32 - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Dropout: 0.1 - **Training Arguments**: - Epochs: 3 - Batch size per device: 2 - Gradient accumulation steps: 4 - Learning rate: 5e-5 - Optimizer: AdamW - Mixed precision: FP16 ### Framework versions - TRL: 0.23.0 - Transformers: 4.57.1 - Pytorch: 2.8.0 - Datasets: 4.1.1 - Tokenizers: 0.22.1 ## Citations Cite TRL as: ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ```