You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Korzystanie z modelu językowego LLaVA-PLLuM-12b-nc-instruct jest dozwolone wyłącznie przed podmioty, w sposób i na zasadach określonych w art. 26[2] polskiej ustawy z dnia z dnia 4 lutego 1994 r. o prawie autorskim i prawach pokrewnych. Aktualna lista podmiotów, o których mowa w zdaniu poprzednim to:

Instytucje dziedzictwa kulturowego,
Uczelnie,
Federacje podmiotów systemu szkolnictwa wyższego i nauki,
Polska Akademia Nauk,
Instytuty naukowe Polskiej Akademii Nauk,
Instytuty badawcze, działające na podstawie ustawy z dnia 30 kwietnia 2010 r. o instytutach badawczych,
Międzynarodowe instytuty naukowe utworzone na podstawie odrębnych ustaw działające na terytorium Rzeczypospolitej Polskiej,
Centrum Łukasiewicz,
Instytuty działające w ramach Sieci Badawczej Łukasiewicz,
Centrum Medycznego Kształcenia Podyplomowego,
Polska Akademia Umiejętności,
Inne podmioty prowadzące głównie działalność naukową w sposób samodzielny i ciągły.

Zastrzegamy możliwość odmowy udzielenia dostępu do modelu językowego, jeżeli uznamy, że podmiot, w ramach którego występujesz o dostęp do modelu językowego, nie spełnia kryteriów, o których mowa w przywołanym wyżej przepisie.
Aby pobrać wagi modelu, konieczne jest uzupełnienie poniższego formularza. W formularzu przekazujesz nam, w celu weryfikacji możliwości udzielenia dostępu do modelu językowego następujące dane: adres e-mail, afiliację (nazwę podmiotu, w ramach którego występujesz o dostęp do modelu językowego) oraz imię i nazwisko. Przekazując nam formularz wyrażasz zgodę na przetwarzanie Twoich danych osobowych. Twoje dane osobowe są przetwarzane na zasadach określonych w klauzuli informacyjnej RODO umieszczonej na tej stronie.

INFORMACJA O PRZETWARZANIU DANYCH OSOBOWYCH

Zgodnie z rozporządzeniem Parlamentu Europejskiego i Rady (UE) 2016/679 z dnia 27 kwietnia 2016 r. w sprawie ochrony osób fizycznych w związku z przetwarzaniem danych osobowych i w sprawie swobodnego przepływu takich danych oraz uchylenia dyrektywy 95/46/WE (ogólne rozporządzenie o ochronie danych) (Dz. Urz. UE L 119 z 04.05.2016, str. 1), (dalej również: „RODO”) informujmy, że:

administratorem Pani/Pana danych osobowych jest Naukowa i Akademicka Sieć Komputerowa - Państwowy Instytut Badawczym z siedzibą w Warszawie, działający pod adresem 01-045 Warszawa, ul. Kolska 12, którego akta rejestrowe przechowuje Sąd Rejonowy dla m.st. Warszawy XIII Wydział Gospodarczy Krajowego Rejestru Sądowego pod numerem 0000012938, REGON: 010464542, NIP: 521-04-17-157 (dalej również: „NASK - PIB”).
NASK - PIB wyznaczył inspektora ochrony danych osobowych, z którym można skontaktować się poprzez e-mail iod@nask.pl.
przetwarzanie Pani/Pana danych osobowych jest niezbędne do wypełnienia obowiązku prawnego wynikającego z przepisu prawa ciążącego na administratorze – art. 26[2] ust. 1 ustawy z dnia 4 lutego 1994 r. o prawie autorskim i prawach pokrewnych, tj. do celów weryfikacji wnioskodawcy o udzielenie dostępu do modeli językowych: NASK-PIB/PLLuM-VL-12B-nc-instruct (art. 6 ust. 1 lit. c) i f) RODO), tworzonych przez Konsorcjum Naukowe powołane w celu realizacji projektu pt. „HIVE AI: Rozwój i pilotażowe wdrożenie dużych modeli językowych w polskiej administracji publicznej” (dalej również: „Projekt”).
Pani/Pana dane osobowe mogą być udostępniane:
a) konsorcjantom realizującym projekt, o którym mowa w pkt. 3, jako oddzielnym administratorom danych osobowych,
b) Skarbowi Państwa – Ministrowi Cyfryzacji, którego urząd – Ministerstwo Cyfryzacji ma siedzibę w Warszawie przy ul. Królewskiej 27, 00-060 Warszawa, NIP 5252955037, REGON 525189465, jako podmiotowi finansującemu projekt pt. „HIVE AI: Rozwój i pilotażowe wdrożenie dużych modeli językowych w polskiej administracji publicznej”;
c) podmiotom przetwarzającym dane osobowe na zlecenie administratora w związku z realizacją Projektu, w tym dostawcom usług IT, kurierom oraz Poczcie Polskiej S.A. dla potrzeb obsługi korespondencji oraz podmiotom świadczącym pomoc prawną.
Pani/Pana dane osobowe nie będą przez NASK-PIB udostępniane do państw trzecich, ani organizacji międzynarodowej.
W odniesieniu do Pani/Pana danych osobowych decyzje nie będą podejmowane w sposób zautomatyzowany, stosowanie do art. 22 RODO.
Posiada Pani/Pan prawo dostępu do swoich danych osobowych, prawo do żądania ich sprostowania, prawo ograniczenia przetwarzania, prawo do usunięcia danych oraz prawo do wniesienia sprzeciwu; realizacja tych praw w zakresie danych, o których mowa w pkt. 3 lit. b) niniejszej klauzuli może być ograniczona lub wyłączona stosownie do art. 89 ust. 2 RODO.
Posiada Pani/Pan prawo do wniesienia skargi do Prezesa Urzędu Ochrony Danych Osobowych.

LLaVA-PLLuM-12b-nc-instruct

This model is the first Polish-focused Vision-Language Model (VLM), created by extending the open-source LLaVA architecture with the PLLuM language model. Our pipeline integrates high-quality multimodal instruction tuning with PLLuM’s strong Polish linguistic abilities, resulting in a VLM that demonstrates significantly improved understanding of Polish language, culture, and context-specific visual reasoning.

Model Details
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Environmental Impact
Technical Specifications
Citation
How to Get Started with the Model

Model Details

Model Description

Developed by: NASK PIB
Funded by: NASK PIB
Shared by: NASK PIB
Model type: Multimodal (Image-Text-to-Text) / Visual Language Model
Language(s) (NLP): Polish, English
License: Model LLaVA-PLLuM-12b-nc-instruct is published under PLLuM-1.0 license.

Model Sources

Demo: Demo

Uses

Direct Use

The model is intended for research and development purposes, specifically focusing on multimodal tasks requiring the Polish language and cultural context. It can be used directly for:

Visual Question Answering (VQA) in Polish: Users can provide an image and ask questions about it in Polish (e.g., "Co znajduje się na zdjęciu?").
Image Captioning: Generating detailed descriptions of images in grammatically correct Polish.
Optical Character Recognition (OCR): Extracting and interpreting text visible within images, including Polish documents.
Object Counting: Performing simple enumeration of objects within a visual scene.
Multimodal Research: Serving as a baseline or starting point for researchers developing non-English or bilingual Vision-Language Models (VLMs).

Downstream Use

This model can be fine-tuned or integrated into larger applications to support specific use cases, such as:

Accessibility Tools: Creating applications that describe surroundings or digital content to visually impaired Polish speakers.
E-commerce: Generating automated product descriptions based on images for Polish marketplaces.
Educational Assistants: Developing tutoring systems that can explain visual content (diagrams, historical photos) to students in Polish.
Specialized Fine-tuning: The model can be further fine-tuned on domain-specific datasets (e.g., Polish medical imaging reports or legal document analysis) to improve performance in niche sectors.

Out-of-Scope Use

Generation of Harmful Content: Utilizing the model to generate hate speech, explicit content, or to facilitate harassment and disinformation.
High-Stakes Factual Retrieval: Like all Large Language Models, this model can "hallucinate" or produce factually incorrect information. It should not be relied upon as a sole source of truth without human verification.
English-Primary Tasks: While the model retains some English capabilities, it is optimized for Polish. Users seeking state-of-the-art performance strictly for English tasks should prefer models trained primarily on English data.

Bias, Risks, and Limitations

Potential Hallucinations: Like other LLMs, PLLuM may occasionally produce factually incorrect or fabricated content.
Sensitivity & Bias: The current version has not undergone multimodal safety alignment. As a result, users may encounter biased behavior or toxic generations, particularly when the model is prompted with visual inputs.
Context Length: Very long context tasks may challenge certain models, depending on memory constraints.

Recommendations

Users (both direct and downstream) should be aware of the risks, biases, and limitations of the model. We recommend the following:

Treat as a Research Proof-of-Concept: This model represents a preliminary step toward robust Polish multimodal AI. It is not a finished commercial product. Users should exercise caution when applying it to real-world scenarios and should not deploy it in production environments without extensive domain-specific testing and guardrails.
Human Verification Required: Like all Large Multimodal Models (LMMs), this model is prone to "hallucinations"—confidently stating incorrect facts or describing objects that are not present in the image. Always keep a human in the loop to verify outputs, especially for factual queries or quantitative tasks (e.g., counting objects).
Awareness of Translation Artifacts: A significant portion of the instruction-tuning dataset (e.g., ALLaVA, LLaVA-Instruct) was automatically translated from English to Polish. While we employed filtering metrics (COMET), some linguistic unnaturalness or translation artifacts may persist in the model's responses.

Training Details

Training Data

The model was trained in two stages using a combination of translated open-source datasets and synthetic data, totaling approximately 2 million samples with an 85% Polish / 15% English split.

Stage 1: Pre-training (Feature Alignment) Stage 2: Instruction Tuning (Visual Instruction Tuning)

Training Procedure

Preprocessing

To create high-quality Polish multimodal data from English sources, a rigorous translation and filtering pipeline was employed:

Translation: Source English datasets were translated using the Tower+ 72B model.
Filtering: The COMET reference-free metric was used to filter out poor-quality translations.
Manual Review: A portion of the data underwent manual expert filtering to ensure linguistic quality.
Dynamic Tiling: Following LLaVA-NeXT, images are processed with dynamic tiling to support higher input resolutions.

Speeds, Sizes, Times

Training Stages: 2 Stages.
Epochs: 1 Epoch for both stages.
Batch Size: 256 (Stage 1), 128 (Stage 2).
Context Size: 8,192 tokens.
Trainable Parameters:
- Stage 1: 30M (Projector only).
- Stage 2: 12B (LLM via LoRA) + 400M (Vision Encoder) + 30M (Projector).
Learning Rates (Stage 2): 2x10⁻⁶ (Vision), 2x10⁻⁵ (Projector & LLM).
LoRA Config: Rank 128, Alpha 256, Dropout 0.05.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Quantitative: MMBench v1.1 (Development Split). The dataset was translated to Polish using Tower+ 72B and subsequently manually corrected by experts to remove translation artifacts (referred to as MMBench-PL).
Qualitative (Model-as-a-Judge): XM3600 (Polish subset), a dataset requiring accurate and culturally relevant image descriptions.

Factors

Language: Performance comparison between Polish (target) and English (source) capabilities.
Task Type: Object recognition, OCR, commonsense reasoning, fine-grained perception, and cultural context recognition.

Metrics

Accuracy: Used for MMBench multiple-choice questions.
Win-rate (LLM-as-a-Judge): Pairwise comparison using LLaVA-OneVision-72B to judge caption quality between PLLuM and baseline models (PaliGemma, Qwen2.5, Pixtral).

Results

Summary

The model demonstrates a significant advancement in Polish multimodal capabilities:

MMBench-PL: Achieved 73.89%, marking a +5.6% improvement over LLaVA-1.6-Vicuna-13B, while maintaining comparable English performance.
Captioning Quality: consistently preferred by the LLM judge over open-source competitors (95.2% win-rate vs. PaliGemma-3B, 62.7% vs. Qwen2.5-VL-7B).
Qualitative Analysis: The model shows superior handling of Polish grammar/morphology and correctly identifies Polish cultural elements (e.g., specific landmarks like the Palace of Culture and Science, regional food like Toruń gingerbread) where generic models often fail.

Societal Impact Assessment

Cultural Inclusion: This model helps bridge the gap in multimodal AI for the Polish language, allowing for technology that reflects local linguistic and cultural nuances rather than defaulting to US-centric norms.
Lack of Safety Alignment: Important: As a research proof-of-concept, this model has not undergone specific safety alignment (e.g., RLHF) for the vision-language domain. Consequently, it may be more prone to generating biased, toxic, or inappropriate responses compared to fully commercialized models, especially when prompted with controversial visual content.
Reliability: Users should be aware of the potential for hallucinations, particularly in OCR or counting tasks, and should not use the model for high-stakes decision-making.

Technical Specifications

Model Architecture and Objective

Architecture: Based on the LLaVA-NeXT framework.
Language Model: PLLuM-12B-nc-instruct (Polish-native, instruction-tuned).
Vision Encoder: SigLIP2 So400m/14, 384px (Chosen for strong multilingual alignment).
Connector: Two-layer MLP projector.
Objective: The model uses a standard autoregressive language modeling objective, conditioned on visual inputs processed through the encoder and projector.

Compute Infrastructure

Hardware

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018129

Citation

Model Card Contact

For questions or contributions, please reach out via: nlp@nask.pl

How to Get Started with the Model

Inference Example using Transformers

Use the code below to run the model. We recommend using transformers >= 4.56.2.

import torch
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    dtype=torch.float16, 
    device_map="auto",
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": "<image>\nOpisz ten obrazek w szczegółach."  # "Describe this image in detail"
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
    images=image, 
    text=prompt, 
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=256)

input_len = inputs.input_ids.shape[1]
generated_ids = output[0][input_len:]
print(processor.decode(generated_ids, skip_special_tokens=True))

Inference with vLLM

You can also use the model via vLLM, see below example. We recommend using vllm >= 0.10.0.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from PIL import Image
import requests


model_id = "NASK-PIB/LLaVA-PLLuM-12b-nc-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(
    model=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=8192,
    limit_mm_per_prompt={"image": 1},
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": "<image>\nOpisz ten obrazek w szczegółach."  # "Describe this image in detail"
    },
]
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

sampling_params = SamplingParams(temperature=0.2, max_tokens=256)
output = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    },
    sampling_params=sampling_params
)

print(output[0].outputs[0].text)

Downloads last month: 22

Safetensors

Model size

13B params

Tensor type

BF16

Model tree for NASK-PIB/LLaVA-PLLuM-12b-nc-instruct

Base model

CYFRAGOVPL/pllum-12b-nc-instruct-250715

Finetuned

(1)

this model

Space using NASK-PIB/LLaVA-PLLuM-12b-nc-instruct 1

Collection including NASK-PIB/LLaVA-PLLuM-12b-nc-instruct

LLaVA-PLLuM

Collection

Polish Vision-Language Model based on PLLuM and LLaVA frameworks • 2 items • Updated 8 days ago

NASK-PIB
/

LLaVA-PLLuM-12b-nc-instruct