jupyterjazz commited on
Commit
d406962
·
verified ·
1 Parent(s): 7c77aab

codebase-refinement (#25)

Browse files

- refactor: multi-vec, st truncation, etc (12eb79618f3d3165c474e9b1b5be720b10349006)
- docs: title and citation (fae02730f619b5992955527b6afb73bd3355099f)
- docs: update vdr info (a8a6bf27c82f2cae94c3c6cb8d04b210329eccf2)
- chore: vdr link (54a9b60b24c88057e9c57cbc92c77f1ab075c3b6)
- docs: vdr phrasing (2273b714a7de182d9e30d00b8cc1b3b0fc350480)

Files changed (6) hide show
  1. README.md +18 -17
  2. config.json +1 -1
  3. custom_st.py +5 -5
  4. modeling_jina_embeddings_v4.py +5 -7
  5. modules.json +1 -1
  6. vidore_eval.md +0 -26
README.md CHANGED
@@ -10,13 +10,13 @@
10
  </p>
11
 
12
  <p align="center">
13
- <b>Jina Embeddings v4: Multilingual Multimodal Embeddings</b>
14
  </p>
15
 
16
 
17
  ## Quick Start
18
 
19
- [Blog](https://alwaysjudgeabookbyitscover.com/) | [Technical Report](https://arxiv.org/abs/2506.18902) | [API](https://jina.ai/embeddings)
20
 
21
 
22
  ## Intended Usage & Model Info
@@ -303,25 +303,15 @@ code_embeddings = model.encode(
303
  # ========================
304
  # 4. Use multivectors
305
  # ========================
306
-
307
- multivector_text_embeddings = model.encode(
308
- sentences=texts,
309
- task="retrieval",
310
- prompt_name="query",
311
- return_multivector=True,
312
- )
313
-
314
- images = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"]
315
-
316
- multivector_image_embeddings = model.encode(
317
- sentences=images,
318
- task="retrieval",
319
- return_multivector=True,
320
- )
321
  ```
322
  </details>
323
 
324
 
 
 
 
 
325
  ## License
326
 
327
  This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
@@ -335,3 +325,14 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
335
  ## Citation
336
 
337
  If you find `jina-embeddings-v4` useful in your research, please cite the following paper:
 
 
 
 
 
 
 
 
 
 
 
 
10
  </p>
11
 
12
  <p align="center">
13
+ <b>Jina Embeddings v4: Universal Embeddings for Multimodal Multilingual Retrieval</b>
14
  </p>
15
 
16
 
17
  ## Quick Start
18
 
19
+ [Blog](https://jina.ai/news/) | [Technical Report](https://arxiv.org/abs/2506.18902) | [API](https://jina.ai/embeddings)
20
 
21
 
22
  ## Intended Usage & Model Info
 
303
  # ========================
304
  # 4. Use multivectors
305
  # ========================
306
+ # If you want to use multi-vector embeddings, please use the Hugging Face model directly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
307
  ```
308
  </details>
309
 
310
 
311
+ ## Jina-VDR
312
+ Alongside `jina-embeddings-v4`, we’re releasing [Jina VDR](https://github.com/jina-ai/jina-vdr), a multilingual, multi-domain benchmark for visual document retrieval. The task collection can be viewed [here](https://huggingface.co/collections/jinaai/jinavdr-visual-document-retrieval-684831c022c53b21c313b449), and evaluation instructions can be found [here](https://github.com/jina-ai/jina-vdr).
313
+
314
+
315
  ## License
316
 
317
  This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
 
325
  ## Citation
326
 
327
  If you find `jina-embeddings-v4` useful in your research, please cite the following paper:
328
+ ```
329
+ @misc{günther2025jinaembeddingsv4universalembeddingsmultimodal,
330
+ title={jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval},
331
+ author={Michael Günther and Saba Sturua and Mohammad Kalim Akram and Isabelle Mohr and Andrei Ungureanu and Sedigheh Eslami and Scott Martens and Bo Wang and Nan Wang and Han Xiao},
332
+ year={2025},
333
+ eprint={2506.18902},
334
+ archivePrefix={arXiv},
335
+ primaryClass={cs.AI},
336
+ url={https://arxiv.org/abs/2506.18902},
337
+ }
338
+ ```
config.json CHANGED
@@ -55,6 +55,6 @@
55
  "vocab_size": 151936,
56
  "truncate_dim": null,
57
  "task_names": ["retrieval", "text-matching", "code"],
58
- "matryoshka_dims": [128, 256, 512, 1024],
59
  "_attn_implementation": "flash_attention_2"
60
  }
 
55
  "vocab_size": 151936,
56
  "truncate_dim": null,
57
  "task_names": ["retrieval", "text-matching", "code"],
58
+ "matryoshka_dims": [128, 256, 512, 1024, 2048],
59
  "_attn_implementation": "flash_attention_2"
60
  }
custom_st.py CHANGED
@@ -103,7 +103,7 @@ class Transformer(nn.Module):
103
  return encoding
104
 
105
  def forward(
106
- self, features: Dict[str, torch.Tensor], task: Optional[str] = None
107
  ) -> Dict[str, torch.Tensor]:
108
  self.model.eval()
109
 
@@ -136,8 +136,8 @@ class Transformer(nn.Module):
136
  text_embeddings = self.model(
137
  **text_batch, task_label=task
138
  ).single_vec_emb
139
- if self.config.truncate_dim:
140
- text_embeddings = text_embeddings[:, : self.config.truncate_dim]
141
  text_embeddings = torch.nn.functional.normalize(text_embeddings, p=2, dim=-1)
142
  for i, embedding in enumerate(text_embeddings):
143
  all_embeddings.append((text_indices[i], embedding))
@@ -154,8 +154,8 @@ class Transformer(nn.Module):
154
  img_embeddings = self.model(
155
  **image_batch, task_label=task
156
  ).single_vec_emb
157
- if self.config.truncate_dim:
158
- img_embeddings = img_embeddings[:, : self.config.truncate_dim]
159
  img_embeddings = torch.nn.functional.normalize(img_embeddings, p=2, dim=-1)
160
 
161
  for i, embedding in enumerate(img_embeddings):
 
103
  return encoding
104
 
105
  def forward(
106
+ self, features: Dict[str, torch.Tensor], task: Optional[str] = None, truncate_dim: Optional[int] = None
107
  ) -> Dict[str, torch.Tensor]:
108
  self.model.eval()
109
 
 
136
  text_embeddings = self.model(
137
  **text_batch, task_label=task
138
  ).single_vec_emb
139
+ if truncate_dim:
140
+ text_embeddings = text_embeddings[:, : truncate_dim]
141
  text_embeddings = torch.nn.functional.normalize(text_embeddings, p=2, dim=-1)
142
  for i, embedding in enumerate(text_embeddings):
143
  all_embeddings.append((text_indices[i], embedding))
 
154
  img_embeddings = self.model(
155
  **image_batch, task_label=task
156
  ).single_vec_emb
157
+ if truncate_dim:
158
+ img_embeddings = img_embeddings[:, : truncate_dim]
159
  img_embeddings = torch.nn.functional.normalize(img_embeddings, p=2, dim=-1)
160
 
161
  for i, embedding in enumerate(img_embeddings):
modeling_jina_embeddings_v4.py CHANGED
@@ -127,13 +127,11 @@ class JinaEmbeddingsV4ModelOutput:
127
  vlm_last_hidden_states (torch.Tensor, optional): Last hidden states of the VLM.
128
  single_vec_emb (torch.Tensor, optional): Single-vector embeddings.
129
  multi_vec_emb (torch.Tensor, optional): Multi-vector embeddings.
130
- attention_mask (torch.Tensor, optional): Attention mask.
131
  """
132
 
133
  vlm_last_hidden_states: Optional[torch.Tensor] = None
134
  single_vec_emb: Optional[torch.Tensor] = None
135
  multi_vec_emb: Optional[torch.Tensor] = None
136
- attention_mask: Optional[torch.Tensor] = None
137
 
138
 
139
  class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
@@ -314,7 +312,6 @@ class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
314
  ),
315
  single_vec_emb=single_vec_emb,
316
  multi_vec_emb=multi_vec_emb,
317
- attention_mask=attention_mask,
318
  )
319
 
320
  def _process_batches(
@@ -345,17 +342,18 @@ class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
345
  device_type=torch.device(self.device).type, dtype=torch.bfloat16
346
  ):
347
  embeddings = self(**batch, task_label=task_label)
348
- attention_mask = embeddings.attention_mask
349
  if not return_multivector:
350
  embeddings = embeddings.single_vec_emb
351
  if truncate_dim is not None:
352
  embeddings = embeddings[:, :truncate_dim]
353
- embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=-1)
354
  else:
355
  embeddings = embeddings.multi_vec_emb
 
356
  if return_multivector and not return_numpy:
357
- valid_tokens = attention_mask.bool()
358
- embeddings = [emb[mask] for emb, mask in zip(embeddings, valid_tokens)]
 
 
359
  results.append(embeddings)
360
  else:
361
  results.append(
 
127
  vlm_last_hidden_states (torch.Tensor, optional): Last hidden states of the VLM.
128
  single_vec_emb (torch.Tensor, optional): Single-vector embeddings.
129
  multi_vec_emb (torch.Tensor, optional): Multi-vector embeddings.
 
130
  """
131
 
132
  vlm_last_hidden_states: Optional[torch.Tensor] = None
133
  single_vec_emb: Optional[torch.Tensor] = None
134
  multi_vec_emb: Optional[torch.Tensor] = None
 
135
 
136
 
137
  class JinaEmbeddingsV4Model(Qwen2_5_VLForConditionalGeneration):
 
312
  ),
313
  single_vec_emb=single_vec_emb,
314
  multi_vec_emb=multi_vec_emb,
 
315
  )
316
 
317
  def _process_batches(
 
342
  device_type=torch.device(self.device).type, dtype=torch.bfloat16
343
  ):
344
  embeddings = self(**batch, task_label=task_label)
 
345
  if not return_multivector:
346
  embeddings = embeddings.single_vec_emb
347
  if truncate_dim is not None:
348
  embeddings = embeddings[:, :truncate_dim]
 
349
  else:
350
  embeddings = embeddings.multi_vec_emb
351
+
352
  if return_multivector and not return_numpy:
353
+ valid_tokens = batch["attention_mask"].bool()
354
+ embeddings = [
355
+ emb[mask] for emb, mask in zip(embeddings, valid_tokens)
356
+ ]
357
  results.append(embeddings)
358
  else:
359
  results.append(
modules.json CHANGED
@@ -4,6 +4,6 @@
4
  "name": "transformer",
5
  "path": "",
6
  "type": "custom_st.Transformer",
7
- "kwargs": ["task"]
8
  }
9
  ]
 
4
  "name": "transformer",
5
  "path": "",
6
  "type": "custom_st.Transformer",
7
+ "kwargs": ["task", "truncate_dim"]
8
  }
9
  ]
vidore_eval.md DELETED
@@ -1,26 +0,0 @@
1
- # How to run the Vidore Evaluation
2
-
3
- If you want to run the vidore evaluation on the jina-embeddings-v4 model (and on the Document Retrieval Benchmark curated by Jina AI), you need to install requirements in [this fork/branch](https://github.com/jina-ai/vidore-benchmark-fork/tree/feat-add-jina-embeddings) (these changes should be merged with the source code of Vidore soon).
4
-
5
- ```
6
- pip install vidore-benchmark[jina-v4]
7
- ```
8
-
9
- You can run the evaluation with the following command:
10
-
11
- ```
12
- vidore-benchmark evaluate-retriever \
13
- --model-class jev4 \
14
- --model-name jinaai/jina-embeddings-v4 \
15
- --collection-name jinaai/jinavdr-visual-document-retrieval-684831c022c53b21c313b449 \
16
- --dataset-format qa \
17
- --split test
18
- ```
19
-
20
- ## Evaluate Pure Text Retrieval Models on Refined Vidore Tasks
21
-
22
- The original Vidore dataset contain multiple text chunks per image to evaluate text retrieval models on them.
23
- Those text chunks are extracted from the document pages using different tools like [Unstructured](https://github.com/Unstructured-IO/unstructured), OCR models, and LLMs.
24
- For evaluating text retrieval models on our filtered versions of the Vidore datasets, you can use the datasets in the collection `https://huggingface.co/collections/jinaai/jina-vdr-vidoreocr-tasks-6852cfc55ccf837e7fecfa1b`.
25
-
26
- It is also possible to evaluate jina-embeddings-v4 and other vision retrieval models on them. This however takes more time and should lead to the same evaluation results as running the vesions of the datasets in the Jina VDR collection.