zhiyucheng alejandrar commited on
Commit
3294490
·
verified ·
1 Parent(s): 41fb45e

Update README.md (#3)

Browse files

- Update README.md (17637264b4dc0206a8518177b3d340f4c58f55da)


Co-authored-by: Alejandra Rico <alejandrar@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +42 -286
README.md CHANGED
@@ -44,7 +44,10 @@ Use Cases: Image summarization. Text-image analysis, Optical Character Recogniti
44
 
45
  ## Release Date:
46
 
47
- - Hugging Face [October 28th, 2025]
 
 
 
48
 
49
  ## Model Architecture:
50
 
@@ -158,299 +161,52 @@ Additional processing for several datasets included rule-based QA generation (e.
158
  ** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
159
 
160
  # Public Datasets <br>
161
- | Dataset Name | Type | Modalities | Number of Samples | Size |
162
- |--------------|------|------------|-------------------|------|
163
- | Captioning on Open Images (subset, relabeled) | VQA | image, text | 1'278'221 | 378.34 GB |
164
- | Localized Narratives (subset, relabeled) | VQA | image, text | 503'275 | 147.67 GB |
165
- | TextCaps (subset) | Image Captioning | image, text | 21'953 | 5.76 GB |
166
- | TextCaps (subset) | Image Captioning | image, text | 109'765 | 28.81 GB |
167
- | TextVQA (subset) | Image Captioning | image, text | 34'602 | 9.08 GB |
168
- | RefCoco | Referring Expression Grounding | image, text | 14'694 | 2.39 GB |
169
- | VQAv2 | VQA | image, text | 28'555 | 4.41 GB |
170
- | AOKVQA | VQA | image, text | 20'832 | 3.39 GB |
171
- | GQA | VQA | image, text | 21'433 | 2.94 GB |
172
- | AOKVQA | VQA | image, text | 16'131 | 2.62 GB |
173
- | synthdog-en | OCR | image, text | 29'672 | 2.31 GB |
174
- | WIT | Image Captioning | image, text | 538'916 | 745.24 GB |
175
- | CLEVR | Image Reasoning | image, text | 70'000 | 12.57 GB |
176
- | CLEVR-Math | Image Reasoning | image, text | 70'000 | 12.47 GB |
177
- | OpenAssistant (oasst1, oasst2) | Text Instruction Tuning | text | 47'118 | 0.09 GB |
178
- | VATEX | Video Captioning | video, text | 2'880 | 5.50 GB |
179
- | YouCook2 | Video Captioning | video, text | 36 | 0.17 GB |
180
- | VCG+ 112K | VideoQA | video, text | 164 | 2.82 GB |
181
- | Video Localized Narratives | Video Captioning | video, text | 373 | 0.64 GB |
182
- | CLEVRER | VQA | video, text | 40'000 | 46.05 GB |
183
- | NExT-QA | VideoQA | video, text | 10'368 | 57.06 GB |
184
- | CLEVRER | Video Reasoning | video, text | 42'620 | 49.10 GB |
185
- | ScreenQA | VQA | image, text | 302'004 | 30.52 GB |
186
- | WikiSQL | Image Reasoning | image, text | N/A | N/A |
187
- | WikiTableQuestions | TextQA | text | N/A | N/A |
188
- | RenderedText | OCR | image, text | N/A | N/A |
189
- | FinQA | Text Reasoning | text | N/A | N/A |
190
- | TAT-QA | Text Reasoning | text | N/A | N/A |
191
- | Databricks Dolly 15K | Text Instruction Tuning | text | N/A | N/A |
192
- | WebSight | Image Classification | image, text | N/A | N/A |
193
- | RAVEN | Image Reasoning | image, text | N/A | N/A |
194
- | VizWiz | VQA | image, text | N/A | N/A |
195
- | Inter-GPS | Image Reasoning | image, text | N/A | N/A |
196
- | OCR dataset from arXiv data | OCR | image, text | 120'000 | 49.99 GB |
197
- | OCR dataset from arXiv data | OCR | image, text | 599'927 | 249.93 GB |
198
- | OCR dataset from arXiv data | OCR | image, text | 1'565'011 | 1637.79 GB |
199
- | OCR dataset from arXiv data | OCR | image, text | 418'059 | 422.04 GB |
200
- | OCR dataset from arXiv data | OCR | image, text | 200'001 | 200.89 GB |
201
- | OCR dataset from arXiv data | OCR | image, text | 200'000 | 198.94 GB |
202
- | OCR dataset from arXiv data | OCR | image, text | 200'001 | 196.08 GB |
203
- | OCR dataset from arXiv data | OCR | image, text | 400'000 | 382.95 GB |
204
- | OCR dataset from arXiv data | OCR | image, text | 400'000 | 388.16 GB |
205
- | OCR dataset from arXiv data | OCR | image, text | 18'280 | 20.98 GB |
206
- | DocLayNet (curated) | OCR | image, text | 48'369 | 18.59 GB |
207
- | DocLayNet (curated & augmented) | OCR | image, text | 48'249 | 9.12 GB |
208
- | DocLayNet (curated & augmented) | OCR | image, text | 48'267 | 9.09 GB |
209
- | SynthTabNet | OCR | image, text | 200'000 | 9.70 GB |
210
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'309 | 17.00 GB |
211
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'461 | 7.77 GB |
212
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'462 | 7.99 GB |
213
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'236 | 5.84 GB |
214
- | OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'232 | 5.92 GB |
215
- | SynthTables | OCR | image, text | 4'887 | 0.38 GB |
216
- | TabRecSet | OCR | image, text | 25'281 | 2.46 GB |
217
- | TabRecSet | OCR | image, text | 25'281 | 1.61 GB |
218
- | FinTabNet | OCR | image, text | 57'137 | 9.22 GB |
219
- | FinTabNet | OCR | image, text | 57'131 | 21.76 GB |
220
- | FinTabNet | OCR | image, text | 57'129 | 21.68 GB |
221
- | PubTables-1M | OCR | image, text | 224'170 | 29.55 GB |
222
- | PubTables-1M | OCR | image, text | 224'169 | 36.32 GB |
223
- | PubTables-1M | OCR | image, text | 225'108 | 36.45 GB |
224
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 37.13 GB |
225
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 33.38 GB |
226
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 32.85 GB |
227
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 31.15 GB |
228
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.30 GB |
229
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 38.40 GB |
230
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 27.09 GB |
231
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 29.52 GB |
232
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.49 GB |
233
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.14 GB |
234
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 100.14 GB |
235
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.82 GB |
236
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.96 GB |
237
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.61 GB |
238
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 89.89 GB |
239
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 95.75 GB |
240
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 85.65 GB |
241
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 91.01 GB |
242
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.29 GB |
243
- | OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 84.66 GB |
244
- | TextOCR | OCR | image, text | 21'727 | 5.83 GB |
245
- | TextOCR | OCR | image, text | 21'138 | 2.83 GB |
246
- | Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'359 | 12.92 GB |
247
- | Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'351 | 14.57 GB |
248
- | Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'350 | 14.44 GB |
249
- | HierText | OCR | image, text | 8'278 | 2.60 GB |
250
- | FUNSD | OCR | image, text | 149 | 0.01 GB |
251
- | Gretel Synthetic Safety Alignment | Safety | Text | 19'779 | 0.03 GB |
252
- | Internal safety alignment multimodal dataset | Safety | image, text | 22'559 | 8.27 GB |
253
- | ALFRED Action | Safety | video, text | 6'524 | 5.92 GB |
254
- | ALFRED Goal | Safety | video, text | 6'464 | 5.86 GB |
255
- | VQA-RAD | Safety | image, text | 1'793 | 0.09 GB |
256
- | SLAKE | Safety | image, text | 9'835 | 0.85 GB |
257
- | STEM MMLU-aux (subset) | Safety | text | 37'444 | 0.49 GB |
258
- | Glaive & Xlam | Function call | text | 8'000 | 0.02 GB |
259
- | Textbooks VQA | VQA | image, text | 46'745 | 10.85 GB |
260
- | ai2d | VQA | image, text | 12'413 | 2.23 GB |
261
- | ScienceQA | VQA | image, text | 12'716 | 0.39 GB |
262
- | ScienceQA from LlaVA-OneVision | VQA | image, text | 19'196 | 0.65 GB |
263
- | ChartQA | VQA | image, text | 15'121 | 0.68 GB |
264
- | ChartQA (augmented) | VQA | image, text | 15'050 | 0.65 GB |
265
- | ChartQA (CoT) | VQA | image, text | 23'571 | 1.04 GB |
266
- | ChartQA | VQA | image, text | 60'438 | 2.69 GB |
267
- | Geo170K | VQA | image, text | 13'263 | 0.07 GB |
268
- | InfographicVQA | VQA | image, text | 23'946 | 8.21 GB |
269
- | DocVQA | VQA | image, text | 39'463 | 26.29 GB |
270
- | DocVQA (CoT) | Image Reasoning | image, text | 16'881 | 10.65 GB |
271
- | ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 524'892 | 96.99 GB |
272
- | ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 227'776 | 42.52 GB |
273
- | TabMWP | Image Reasoning | image, text | 23'058 | 0.30 GB |
274
- | PMC-VQA | VQA | image, text | 2'266 | 0.04 GB |
275
- | OCR-VQA from The Cauldron | VQA | image, text | 165'746 | 5.79 GB |
276
- | ST-VQA from The Cauldron | VQA | image, text | 17'232 | 0.68 GB |
277
- | WebSight from The Cauldron | OCR | image, text | 9'809 | 1.84 GB |
278
- | EST-VQA | VQA | image, text | 17'043 | 4.25 GB |
279
- | TAL Handwritten English OCR | OCR | image, text | 9'998 | 0.22 GB |
280
- | TAL Handwritten Math writing | OCR | image, text | 22'244 | 0.33 GB |
281
- | SlideVQA | VQA | image, text | 5'773 | 0.42 GB |
282
- | pixmo-docs | VQA | image, text | 251'165 | 34.88 GB |
283
- | pixmo-cap | Image Captioning | image, text | 706'897 | 261.63 GB |
284
- | pixmo-cap-qa | VQA | image, text | 214'978 | 56.72 GB |
285
- | pixmo-ask-model-anything | Visual Instruction Tuning | image, text | 153'592 | 20.50 GB |
286
- | TallyQA | VQA | image, text | 68'775 | 10.64 GB |
287
- | Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 490.37 GB |
288
- | Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 488.17 GB |
289
- | Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'128'326 | 324.46 GB |
290
- | TabMWP (CoT) | Image Reasoning | image, text | 20'305 | 0.28 GB |
291
- | VisualWebInstruct | Visual Instruction Tuning | image, text | 260'419 | 7.41 GB |
292
- | Internal collection of public text SFT datasets | Text Instruction Tuning | text | 197'938 | 1.04 GB |
293
- | ReCTS from ICDAR2019 | OCR | image, text | 20'000 | 1.77 GB |
294
- | RCTW from ICDAR2017 | OCR | image, text | 8'034 | 7.85 GB |
295
- | OCR equation heavy dataset from arXiv data | OCR | image, text | 2'000 | 0.03 GB |
296
- | Mulberry-SFT (CoT) | Image Reasoning | image, text | 191'332 | 30.80 GB |
297
- | LLaVA-CoT-100k (CoT) | Image Reasoning | image, text | 63'013 | 8.18 GB |
298
- | GeomVerse (CoT) | Image Reasoning | image, text | 9'298 | 0.90 GB |
299
- | MapQA (CoT) | Image Reasoning | image, text | 16'832 | 1.77 GB |
300
- | MetaMathQA (CoT) | Text Reasoning | text | 225'408 | 4.55 GB |
301
- | MetaMathQA (CoT) | Image Reasoning | image, text | 220'544 | 4.48 GB |
302
- | PlotQA (CoT) | Image Reasoning | image, text | 16'256 | 0.76 GB |
303
- | Visual7W Telling (CoT) | Image Reasoning | image, text | 62'592 | 3.21 GB |
304
- | Visual7W Pointing | VQA | image, text | 25'733 | 0.93 GB |
305
- | VisText | Image Captioning | image, text | 9'969 | 0.52 GB |
306
- | ScreenQA | VQA | image, text | 32'724 | 3.51 GB |
307
- | wave-ui-25k | OCR | image, text | 24'978 | 11.44 GB |
308
- | Charts2500 | VQA | image, text | 2'486 | 0.09 GB |
309
- | Cyrillic | OCR | image, text | 72'284 | 1.49 GB |
310
- | CMM-Math | Image Reasoning | image, text | 13'148 | 0.05 GB |
311
- | SimChart9K | Image Reasoning | image, text | 9'536 | 0.69 GB |
312
- | UniChart | Image Reasoning | image, text | 504'885 | 17.04 GB |
313
- | CASIA-HWDB2-line | OCR | image, text | 2'193 | 0.09 GB |
314
- | MMTab | VQA | image, text | 232'746 | 59.23 GB |
315
- | ArxivQA | VQA | image, text | 99'995 | 17.32 GB |
316
- | docmatix-single | VQA | image, text | 19'992 | 3.94 GB |
317
- | DocReason525K | Image Reasoning | image, text | 25'863 | 33.80 GB |
318
- | FigureQA | VQA | image, text | 100'000 | 2.37 GB |
319
- | LRV-Instruction | Visual Instruction Tuning | image, text | 7'198 | 0.37 GB |
320
- | VisualWebInstruct (CoT) | Image Reasoning | image, text | 48'929 | 4.37 GB |
321
- | DocMatix (multi-page) | Image Reasoning | image, text | 19'969 | 8.66 GB |
322
- | spot-the-diff | Image Reasoning | image, text | 8'007 | 1.45 GB |
323
- | DocVQA (CoT) | Image Reasoning | image, text | 36'333 | 24.32 GB |
324
- | DocVQA (CoT) | Image Reasoning | image, text | 45'710 | 2.10 GB |
325
- | DocVQA (CoT) | Image Reasoning | image, text | 19'548 | 6.70 GB |
326
- | Mulberry-SFT (subset, CoT) | Image Reasoning | image, text | 103'763 | 18.45 GB |
327
- | UniGeo (CoT) | Image Reasoning | image, text | 9'728 | 0.05 GB |
328
- | NIGHTS | Image Reasoning | image, text | 12'906 | 37.01 GB |
329
- | Mantis-Instruct (CoT) | Image Reasoning | image, text | 67'723 | 13.86 GB |
330
- | OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 2'858 | 1.23 GB |
331
- | OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 586 | 0.46 GB |
332
- | FinTabNet (relabeled) | Image Reasoning | image, text | 8'356 | 3.17 GB |
333
- | Table OCR on pdfs from CommonCrawl | Image Reasoning | image, text | 4'846 | 3.65 GB |
334
- | HierText (relabeled for QA) | Image Reasoning | image, text | 514 | 0.07 GB |
335
- | ECD-10k-Images | Image Reasoning | image, text | 132'613 | 15.38 GB |
336
- | ActivityNet (open-ended QA) | VideoQA | video, text | 6'490 | 162.22 GB |
337
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 5'496 | 11.07 GB |
338
- | NExT-QA (open-ended QA) | VideoQA | video, text | 5'492 | 10.99 GB |
339
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 52 | 0.74 GB |
340
- | NExT-QA (open-ended QA) | VideoQA | video, text | 61 | 0.85 GB |
341
- | NExT-QA (open-ended QA) | VideoQA | video, text | 6'843 | 27.83 GB |
342
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 6'843 | 27.85 GB |
343
- | ActivityNet (open-ended QA) | VideoQA | video, text | 7'420 | 102.81 GB |
344
- | ActivityNet (open-ended QA) | VideoQA | video, text | 3'840 | 25.84 GB |
345
- | NExT-QA (multi-choice QA) | VideoQA | video, text | 4'633 | 35.38 GB |
346
- | NExT-QA (open-ended QA) | VideoQA | video, text | 4'694 | 35.84 GB |
347
- | ActivityNet (open-ended QA) | VideoQA | video, text | 2'580 | 7.46 GB |
348
- | Perception Test (multi-choice QA) | VideoQA | video, text | 1'785 | 18.67 GB |
349
- | Perception Test (multi-choice QA) | VideoQA | video, text | 618 | 11.52 GB |
350
- | NExT-QA | VideoQA | video, text | 34'132 | 150.86 GB |
351
- | CLEVRER | VideoQA | video, text | 40'000 | 46.03 GB |
352
- | Video dataset based on Kinetics | VideoQA | video, text | 39'452 | 26.15 GB |
353
- | EGO4D | VideoQA | video, text | 7'797 | 3.38 GB |
354
- | TVQA | VideoQA | video, text | 34'868 | 100.05 GB |
355
- | EgoExoLearn | VideoQA | video, text | 36'373 | 8558.27 GB |
356
- | Video dataset based on Kinetics | VideoQA | video, text | 647'883 | 890.56 GB |
357
- | Mementos | VideoQA | video, text | 4'060 | 14.07 GB |
358
- | Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
359
- | ActivityNet | VideoQA | video, text | 10'021 | 191.49 GB |
360
- | EGO4D | VideoQA | video, text | 1'506 | 137.00 GB |
361
- | FineAction | VideoQA | video, text | 7'504 | 169.76 GB |
362
- | HACS | VideoQA | video, text | 31'223 | 829.25 GB |
363
- | HiREST | VideoQA | video, text | 822 | 42.50 GB |
364
- | Perception Test | VideoQA | video, text | 2'135 | 25.98 GB |
365
- | ActivityNet | VideoQA | video, text | 9'064 | 181.24 GB |
366
- | HiREST | VideoQA | video, text | 525 | 27.54 GB |
367
- | YouCook2 | VideoQA | video, text | 1'180 | 77.65 GB |
368
- | DiDeMo | VideoQA | video, text | 7'452 | 33.90 GB |
369
- | EGO4D | VideoQA | video, text | 2'665 | 194.01 GB |
370
- | MedVidQA | VideoQA | video, text | 933 | 40.35 GB |
371
- | QuerYD | VideoQA | video, text | 1'562 | 50.69 GB |
372
- | YouCook2 | VideoQA | video, text | 2'270 | 158.77 GB |
373
- | EgoExoLearn (open-ended QA) | VideoQA | video, text | 9'998 | 1751.69 GB |
374
- | Breakfast Actions | VideoQA | video, text | 1'204 | 3.45 GB |
375
- | EgoExoLearn (multi-choice QA) | VideoQA | video, text | 6'832 | 1196.41 GB |
376
- | CrossTask (multi-choice QA) | VideoQA | video, text | 75'686 | 417.50 GB |
377
- | CrossTask (open-ended QA) | VideoQA | video, text | 20'399 | 112.02 GB |
378
- | EgoProceL (multi-choice QA) | VideoQA | video, text | 4'789 | 42.74 GB |
379
- | EgoProceL (open-ended QA) | VideoQA | video, text | 5'667 | 50.58 GB |
380
- | HC-STVG (multi-choice QA) | VideoQA | video, text | 147'799 | 796.18 GB |
381
- | HC-STVG (open-ended QA) | VideoQA | video, text | 41'050 | 221.82 GB |
382
- | TAPOS (multi-choice QA) | VideoQA | video, text | 33'941 | 218.50 GB |
383
- | TAPOS (open-ended QA) | VideoQA | video, text | 13'991 | 88.00 GB |
384
- | Multi-page OCR based on CommonCrawl pdf data | VQA | image, text | 7'262 | 48.19 GB |
385
- | Multi-page QA based on CommonCrawl pdf data | VQA | image, text | 455 | 31.88 GB |
386
- | Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'281 | 0.68 GB |
387
- | Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'285 | 0.67 GB |
388
- | Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'282 | 0.67 GB |
389
- | Selection of public datasets (relabeled) | Image Reasoning | image, text | 13'843 | 4.18 GB |
390
- | Selection of public datasets (relabeled) | Image Reasoning | image, text | 18'442 | 3.89 GB |
391
- | Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
392
- | Perception Test (CoT) | VideoQA | video, text | 4'977 | 64.55 GB |
393
-
394
-
395
- <br>
396
-
397
  # Private Datasets <br>
398
- | Dataset Name | Type | Modalities | Number of Samples | Size |
399
- |--------------|------|------------|-------------------|------|
400
- | Internal safety alignment text dataset | Safety | Text | N/A | N/A |
401
- | Internal safety alignment text dataset | Safety | Text | N/A | N/A |
402
- | Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 445'958 | 9.01 GB |
403
- | Internal QA dataset on invoices | Image Reasoning | image, text | 6'471 | 5.22 GB |
404
- | Internal QA dataset on invoices | Image Reasoning | image, text | 11'258 | 10.19 GB |
405
- <br>
406
 
407
  # Data Crawling and Scraping <br>
408
- | Dataset Name | Type | Modalities | Number of Samples | Size |
409
- |--------------|------|------------|-------------------|------|
410
- | Internal video dataset | VideoQA | video, text | 274'472 | 348.84 GB |
411
- | Internal video dataset | VideoQA | video, text | 14'256 | 44.46 GB |
412
- | Internal VQA and captioning dataset | Image Captioning | image, text | 14'872 | 3.27 GB |
413
- | Internal VQA dataset | VQA | image, text | 20'250 | 1.87 GB |
414
- | Internal VQA dataset | VQA | image, text | 20'098 | 2.07 GB |
415
- | Internal Captioning dataset | Image Captioning | image, text | 24'998 | 6.97 GB |
416
- <br>
417
 
418
  # User-Sourced Data (Collected by Provider including Prompts) <br>
419
  <br>
420
 
421
  # Self-Sourced Synthetic Data <br>
422
- | Dataset Name | Type | Modalities | Number of Samples | Size |
423
- |--------------|------|------------|-------------------|------|
424
- | Random ASCII characters for OCR | OCR | image, text | 14'533 | 5.76 GB |
425
- | Random ASCII characters for OCR | OCR | image, text | 14'533 | 9.26 GB |
426
- | Random Chinese characters for OCR | OCR | image, text | 29'108 | 15.00 GB |
427
- | Random Chinese characters for OCR | OCR | image, text | 29'108 | 24.11 GB |
428
- | Random English characters for OCR | OCR | image, text | 14'525 | 5.65 GB |
429
- | Random English characters for OCR | OCR | image, text | 14'525 | 9.39 GB |
430
- | Synthetic sparse table dataset | OCR | image, text | 100'000 | 14.36 GB |
431
- | Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 1'165'591 | 54.15 GB |
432
- | Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 175'000 | 0.95 GB |
433
- | Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 1'922'012 | 28.00 GB |
434
- | Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 288'000 | 0.57 GB |
435
- | Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 67'000 | 0.22 GB |
436
- | Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 403'619 | 6.55 GB |
437
- | Synthetic safety data with responses from DeepSeek-R1-0528 | Text Reasoning | text | 30'710 | 0.12 GB |
438
- | Dummy conversation dataset | Text Reasoning | text | 2'262 | 0.00 GB |
439
- | Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 32'752 | 0.26 GB |
440
- | Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 3'636 | 0.01 GB |
441
- | Synthetic chat dataset with responses from DeepSeek-R1 | Text Reasoning | text | 389'350 | 3.30 GB |
442
- | Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 353'526 | 2.61 GB |
443
- | Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 361'733 | 1.12 GB |
444
- | Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct | Text Reasoning | text | 4'999'794 | 86.68 GB |
445
- | Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 545'844 | 5.25 GB |
446
- | Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 81'876 | 0.43 GB |
447
- | Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 1'591'641 | 58.63 GB |
448
- | Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 239'467 | 0.52 GB |
449
- | Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Code | text | 1'165'591 | 54.15 GB |
450
- | Synthetic tool calling dataset from DeepSeek-R1-0528 | Text Reasoning | text | 74'044 | 46.43 GB |
451
- <br>
452
-
453
-
454
 
455
 
456
  **Properties**<br>
 
44
 
45
  ## Release Date:
46
 
47
+ - Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
48
+ - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-BF16](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16)
49
+ - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8)
50
+ - Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD)
51
 
52
  ## Model Architecture:
53
 
 
161
  ** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
162
 
163
  # Public Datasets <br>
164
+ | Type | Data Type | Total Samples | Total Size (GB) |
165
+ |------|-----------|---------------|------------------|
166
+ | Function call | text | 8,000 | 0.02 |
167
+ | Image Captioning | image, text | 1,422,102 | 1,051.04 |
168
+ | Image Reasoning | image, text | 1,888,217 | 286.95 |
169
+ | OCR | image, text | 9,830,570 | 5,317.60 |
170
+ | Referring Expression Grounding | image, text | 14,694 | 2.39 |
171
+ | Safety | image, text | 34,187 | 9.21 |
172
+ | Safety | text | 57,223 | 0.52 |
173
+ | Safety | video, text | 12,988 | 11.78 |
174
+ | Text Instruction Tuning | text | 245,056 | 1.13 |
175
+ | Text Reasoning | text | 225,408 | 4.55 |
176
+ | VQA | image, text | 8,174,136 | 2,207.52 |
177
+ | VQA | video, text | 40,000 | 46.05 |
178
+ | Video Captioning | video, text | 3,289 | 6.31 |
179
+ | Video Reasoning | video, text | 42,620 | 49.10 |
180
+ | VideoQA | video, text | 1,371,923 | 17,641.79 |
181
+ | Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
182
+ | **TOTAL** | | **24,544,290** | **26,803.75** |
183
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
  # Private Datasets <br>
185
+ | Type | Modalities | Total Samples | Total Size (GB) |
186
+ |------|------------|---------------|------------------|
187
+ | Image Reasoning | image, text | 17,729 | 15.41 |
188
+ | Text Reasoning | text | 445,958 | 9.01 |
189
+ | **TOTAL** | | **463,687** | **24.42** |
190
+
 
 
191
 
192
  # Data Crawling and Scraping <br>
193
+ | Type | Modalities | Total Samples | Total Size (GB) |
194
+ |------|------------|---------------|------------------|
195
+ | Image Captioning | image, text | 39,870 | 10.24 |
196
+ | VQA | image, text | 40,348 | 3.94 |
197
+ | VideoQA | video, text | 288,728 | 393.30 |
198
+ | **TOTAL** | | **368,946** | **407.48** |
 
 
 
199
 
200
  # User-Sourced Data (Collected by Provider including Prompts) <br>
201
  <br>
202
 
203
  # Self-Sourced Synthetic Data <br>
204
+ | Type | Data Type | Total Samples | Total Size (GB) |
205
+ |------|-----------|---------------|------------------|
206
+ | Code | text | 1,165,591 | 54.15 |
207
+ | OCR | image, text | 216,332 | 83.53 |
208
+ | Text Reasoning | text | 12,727,857 | 295.80 |
209
+ | **TOTAL** | | **14,109,780** | **433.48** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
 
212
  **Properties**<br>