codebase-refinement (#25)

- refactor: multi-vec, st truncation, etc (12eb79618f3d3165c474e9b1b5be720b10349006)
- docs: title and citation (fae02730f619b5992955527b6afb73bd3355099f)
- docs: update vdr info (a8a6bf27c82f2cae94c3c6cb8d04b210329eccf2)
- chore: vdr link (54a9b60b24c88057e9c57cbc92c77f1ab075c3b6)
- docs: vdr phrasing (2273b714a7de182d9e30d00b8cc1b3b0fc350480)

Files changed (6) hide show

README.md +18 -17
config.json +1 -1
custom_st.py +5 -5
modeling_jina_embeddings_v4.py +5 -7
modules.json +1 -1
vidore_eval.md +0 -26

README.md CHANGED Viewed

@@ -10,13 +10,13 @@
 </p>
 <p align="center">
-<b>Jina Embeddings v4: Multilingual Multimodal Embeddings</b>
 </p>
 ## Quick Start
-[Blog](https://alwaysjudgeabookbyitscover.com/) | [Technical Report](https://arxiv.org/abs/2506.18902) | [API](https://jina.ai/embeddings)
 ## Intended Usage & Model Info
@@ -303,25 +303,15 @@ code_embeddings = model.encode(
 # ========================
 # 4. Use multivectors
 # ========================
-multivector_text_embeddings = model.encode(
-    sentences=texts,
-    task="retrieval",
-    prompt_name="query",
-    return_multivector=True,
-)
-images = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"]
-multivector_image_embeddings = model.encode(
-    sentences=images,
-    task="retrieval",
-    return_multivector=True,
-)
 ```
 </details>
 ## License
 This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
@@ -335,3 +325,14 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
 ## Citation
 If you find `jina-embeddings-v4` useful in your research, please cite the following paper:

 </p>
 <p align="center">
+<b>Jina Embeddings v4: Universal Embeddings for Multimodal Multilingual Retrieval</b>
 </p>
 ## Quick Start
+[Blog](https://jina.ai/news/) | [Technical Report](https://arxiv.org/abs/2506.18902) | [API](https://jina.ai/embeddings)
 ## Intended Usage & Model Info
 # ========================
 # 4. Use multivectors
 # ========================
+# If you want to use multi-vector embeddings, please use the Hugging Face model directly.
 ```
 </details>
+## Jina-VDR
+Alongside `jina-embeddings-v4`, we’re releasing [Jina VDR](https://github.com/jina-ai/jina-vdr), a multilingual, multi-domain benchmark for visual document retrieval. The task collection can be viewed [here](https://huggingface.co/collections/jinaai/jinavdr-visual-document-retrieval-684831c022c53b21c313b449), and evaluation instructions can be found [here](https://github.com/jina-ai/jina-vdr).
 ## License
 This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
 ## Citation
 If you find `jina-embeddings-v4` useful in your research, please cite the following paper:
+```
+@misc{günther2025jinaembeddingsv4universalembeddingsmultimodal,
+      title={jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval},
+      author={Michael Günther and Saba Sturua and Mohammad Kalim Akram and Isabelle Mohr and Andrei Ungureanu and Sedigheh Eslami and Scott Martens and Bo Wang and Nan Wang and Han Xiao},
+      year={2025},
+      eprint={2506.18902},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2506.18902},
+}
+```

config.json CHANGED Viewed

@@ -55,6 +55,6 @@
   "vocab_size": 151936,
   "truncate_dim": null,
   "task_names": ["retrieval", "text-matching", "code"],
-  "matryoshka_dims": [128, 256, 512, 1024],
   "_attn_implementation": "flash_attention_2"
 }

   "vocab_size": 151936,
   "truncate_dim": null,
   "task_names": ["retrieval", "text-matching", "code"],
+  "matryoshka_dims": [128, 256, 512, 1024, 2048],
   "_attn_implementation": "flash_attention_2"
 }

custom_st.py CHANGED Viewed

@@ -103,7 +103,7 @@ class Transformer(nn.Module):
         return encoding
     def forward(
-        self, features: Dict[str, torch.Tensor], task: Optional[str] = None
     ) -> Dict[str, torch.Tensor]:
         self.model.eval()
@@ -136,8 +136,8 @@ class Transformer(nn.Module):
                     text_embeddings = self.model(
                         **text_batch, task_label=task
                     ).single_vec_emb
-                    if self.config.truncate_dim:
-                        text_embeddings = text_embeddings[:, : self.config.truncate_dim]
                         text_embeddings = torch.nn.functional.normalize(text_embeddings, p=2, dim=-1)
                 for i, embedding in enumerate(text_embeddings):
                     all_embeddings.append((text_indices[i], embedding))
@@ -154,8 +154,8 @@ class Transformer(nn.Module):
                     img_embeddings = self.model(
                         **image_batch, task_label=task
                     ).single_vec_emb
-                    if self.config.truncate_dim:
-                        img_embeddings = img_embeddings[:, : self.config.truncate_dim]
                         img_embeddings = torch.nn.functional.normalize(img_embeddings, p=2, dim=-1)
                 for i, embedding in enumerate(img_embeddings):

         return encoding
     def forward(
+        self, features: Dict[str, torch.Tensor], task: Optional[str] = None, truncate_dim: Optional[int] = None
     ) -> Dict[str, torch.Tensor]:
         self.model.eval()
                     text_embeddings = self.model(
                         **text_batch, task_label=task
                     ).single_vec_emb
+                    if truncate_dim:
+                        text_embeddings = text_embeddings[:, : truncate_dim]
                         text_embeddings = torch.nn.functional.normalize(text_embeddings, p=2, dim=-1)
                 for i, embedding in enumerate(text_embeddings):
                     all_embeddings.append((text_indices[i], embedding))
                     img_embeddings = self.model(
                         **image_batch, task_label=task
                     ).single_vec_emb
+                    if truncate_dim:
+                        img_embeddings = img_embeddings[:, : truncate_dim]
                         img_embeddings = torch.nn.functional.normalize(img_embeddings, p=2, dim=-1)
                 for i, embedding in enumerate(img_embeddings):

modeling_jina_embeddings_v4.py CHANGED Viewed

@@ -127,13 +127,11 @@ class JinaEmbeddingsV4ModelOutput:
         vlm_last_hidden_states (torch.Tensor, optional): Last hidden states of the VLM.
         single_vec_emb (torch.Tensor, optional): Single-vector embeddings.
         multi_vec_emb (torch.Tensor, optional): Multi-vector embeddings.
-        attention_mask (torch.Tensor, optional): Attention mask.
     """
     vlm_last_hidden_states: Optional[torch.Tensor] = None
     single_vec_emb: Optional[torch.Tensor] = None
     multi_vec_emb: Optional[torch.Tensor] = None
-    attention_mask: Optional[torch.Tensor] = None
 class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
@@ -314,7 +312,6 @@ class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
             ),
             single_vec_emb=single_vec_emb,
             multi_vec_emb=multi_vec_emb,
-            attention_mask=attention_mask,
         )
     def _process_batches(
@@ -345,17 +342,18 @@ class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
                     device_type=torch.device(self.device).type, dtype=torch.bfloat16
                 ):
                     embeddings = self(**batch, task_label=task_label)
-                    attention_mask = embeddings.attention_mask
                     if not return_multivector:
                         embeddings = embeddings.single_vec_emb
                         if truncate_dim is not None:
                             embeddings = embeddings[:, :truncate_dim]
-                            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
                     else:
                         embeddings = embeddings.multi_vec_emb
                     if return_multivector and not return_numpy:
-                        valid_tokens = attention_mask.bool()
-                        embeddings = [emb[mask] for emb, mask in zip(embeddings, valid_tokens)]
                         results.append(embeddings)
                     else:
                         results.append(

         vlm_last_hidden_states (torch.Tensor, optional): Last hidden states of the VLM.
         single_vec_emb (torch.Tensor, optional): Single-vector embeddings.
         multi_vec_emb (torch.Tensor, optional): Multi-vector embeddings.
     """
     vlm_last_hidden_states: Optional[torch.Tensor] = None
     single_vec_emb: Optional[torch.Tensor] = None
     multi_vec_emb: Optional[torch.Tensor] = None
 class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
             ),
             single_vec_emb=single_vec_emb,
             multi_vec_emb=multi_vec_emb,
         )
     def _process_batches(
                     device_type=torch.device(self.device).type, dtype=torch.bfloat16
                 ):
                     embeddings = self(**batch, task_label=task_label)
                     if not return_multivector:
                         embeddings = embeddings.single_vec_emb
                         if truncate_dim is not None:
                             embeddings = embeddings[:, :truncate_dim]
                     else:
                         embeddings = embeddings.multi_vec_emb
                     if return_multivector and not return_numpy:
+                        valid_tokens = batch["attention_mask"].bool()
+                        embeddings = [
+                            emb[mask] for emb, mask in zip(embeddings, valid_tokens)
+                        ]
                         results.append(embeddings)
                     else:
                         results.append(

modules.json CHANGED Viewed

@@ -4,6 +4,6 @@
         "name": "transformer",
         "path": "",
         "type": "custom_st.Transformer",
-        "kwargs": ["task"]
     }
 ]

         "name": "transformer",
         "path": "",
         "type": "custom_st.Transformer",
+        "kwargs": ["task", "truncate_dim"]
     }
 ]

vidore_eval.md DELETED Viewed

@@ -1,26 +0,0 @@
-# How to run the Vidore Evaluation
-If you want to run the vidore evaluation on the jina-embeddings-v4 model (and on the Document Retrieval Benchmark curated by Jina AI), you need to install requirements in [this fork/branch](https://github.com/jina-ai/vidore-benchmark-fork/tree/feat-add-jina-embeddings) (these changes should be merged with the source code of Vidore soon).
-```
-pip install vidore-benchmark[jina-v4]
-```
-You can run the evaluation with the following command:
-```
-vidore-benchmark evaluate-retriever \
-    --model-class jev4 \
-    --model-name jinaai/jina-embeddings-v4 \
-    --collection-name jinaai/jinavdr-visual-document-retrieval-684831c022c53b21c313b449 \
-    --dataset-format qa \
-    --split test
-```
-## Evaluate Pure Text Retrieval Models on Refined Vidore Tasks
-The original Vidore dataset contain multiple text chunks per image to evaluate text retrieval models on them.
-Those text chunks are  extracted from the document pages using different tools like [Unstructured](https://github.com/Unstructured-IO/unstructured), OCR models, and LLMs.
-For evaluating text retrieval models on our filtered versions of the Vidore datasets, you can use the datasets in the collection `https://huggingface.co/collections/jinaai/jina-vdr-vidoreocr-tasks-6852cfc55ccf837e7fecfa1b`.
-It is also possible to evaluate jina-embeddings-v4 and other vision retrieval models on them. This however takes more time and should lead to the same evaluation results as running the vesions of the datasets in the Jina VDR collection.