ColNetraEmbed
ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
Model Description
ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).
- Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
- Architecture: ColPali with Gemma3-4B backbone
- Embedding Dimension: 128 per token
- Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
- Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search
Paper
๐ M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Installation
pip install git+https://github.com/adithya-s-k/colpali.git
Quick Start
import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3
# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)
# Load your images
images = [
Image.open("document1.jpg"),
Image.open("document2.jpg"),
]
# Define queries
queries = [
"What is the total revenue?",
"Show me the organizational chart",
]
# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128)
query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128)
# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
qs=query_embeddings,
ps=image_embeddings,
) # Shape: (num_queries, num_images)
# Get best matches
for i, query in enumerate(queries):
best_idx = scores[i].argmax().item()
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
Use Cases
- Document Retrieval: Search through large collections of visual documents
- Visual Question Answering: Answer questions about document content
- Document Understanding: Extract and match information from scanned documents
- Cross-lingual Document Search: Multilingual visual document retrieval
Model Details
- Base Model: Gemma3-4B-IT
- Vision Encoder: SigLIP
- Training Data: Multilingual document datasets
- Embedding Strategy: Multi-vector (Late Interaction)
- Similarity Function: MaxSim (Maximum Similarity)
Performance
ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on Nayana-IR Bench (22 languages) and ViDoRe v2.
Benchmark Results
Nayana-IR Cross-Lingual
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|---|---|---|---|---|
| ColNetraEmbed | 0.637 | 0.700 | 0.610 | 0.610 |
| Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 |
| ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 |
| ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 |
| GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 |
| ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 |
| ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 |
Nayana-IR Monolingual
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|---|---|---|---|---|
| ColNetraEmbed | 0.670 | 0.764 | 0.645 | 0.686 |
| ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 |
| ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 |
| GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 |
| ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 |
| ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 |
ViDoRe v2
| Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 |
|---|---|---|---|---|
| ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 |
| Jina-Embeddings-v4 | 0.576 | 0.686 | - | - |
| GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 |
| ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 |
| ColNetraEmbed | 0.551 | 0.664 | 0.445 | 0.445 |
| ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 |
| ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 |
Key Results:
- ๐ Strong multilingual performance with ColBERT-style late interaction
- ๐ 124% improvement over ColPali-v1.3 on cross-lingual tasks
- ๐ Supports 22 languages across diverse script families
- ๐ Fine-grained matching through token-level MaxSim scoring
Comparison: Multi-vector vs Single-vector
- ColNetraEmbed (multi-vector): More interpretable with token-level attribution
- NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage
See our paper for comprehensive evaluation and architectural comparisons.
Citation
@misc{kolavi2025m3druniversalmultilingualmultimodal,
title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
author={Adithya S Kolavi and Vyoman Jain},
year={2025},
eprint={2512.03514},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2512.03514}
}
License
This model is released under the same license as the base Gemma3 model.
Acknowledgments
This work benefited from compute credits for training, inference, and evaluation provided by Modal, acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the Meta LLaMA Impact Grant through our Nayana initiative. We appreciate Meta for continued support of our research efforts at CognitiveLab.
Built on top of the ColPali framework and Gemma3 architecture.
- Downloads last month
- 94
Model tree for Cognitive-Lab/ColNetraEmbed
Space using Cognitive-Lab/ColNetraEmbed 1
Evaluation results
- NDCG@5 on Nayana-IR Cross-Lingualtest set self-reported0.637
- Recall@10 on Nayana-IR Cross-Lingualtest set self-reported0.700
- MAP@10 on Nayana-IR Cross-Lingualtest set self-reported0.610
- MRR@10 on Nayana-IR Cross-Lingualtest set self-reported0.610
- NDCG@5 on Nayana-IR Monolingualtest set self-reported0.670
- Recall@10 on Nayana-IR Monolingualtest set self-reported0.764
- MAP@10 on Nayana-IR Monolingualtest set self-reported0.645
- MRR@10 on Nayana-IR Monolingualtest set self-reported0.686
- NDCG@5 on ViDoRe v2test set self-reported0.551
- Recall@10 on ViDoRe v2test set self-reported0.664
