Update README.md (#3)
Browse files- Update README.md (17637264b4dc0206a8518177b3d340f4c58f55da)
Co-authored-by: Alejandra Rico <alejandrar@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -44,7 +44,10 @@ Use Cases: Image summarization. Text-image analysis, Optical Character Recogniti
|
|
| 44 |
|
| 45 |
## Release Date:
|
| 46 |
|
| 47 |
-
-
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## Model Architecture:
|
| 50 |
|
|
@@ -158,299 +161,52 @@ Additional processing for several datasets included rule-based QA generation (e.
|
|
| 158 |
** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
|
| 159 |
|
| 160 |
# Public Datasets <br>
|
| 161 |
-
|
|
| 162 |
-
|
| 163 |
-
|
|
| 164 |
-
|
|
| 165 |
-
|
|
| 166 |
-
|
|
| 167 |
-
|
|
| 168 |
-
|
|
| 169 |
-
|
|
| 170 |
-
|
|
| 171 |
-
|
|
| 172 |
-
|
|
| 173 |
-
|
|
| 174 |
-
|
|
| 175 |
-
|
|
| 176 |
-
|
|
| 177 |
-
|
|
| 178 |
-
|
|
| 179 |
-
|
|
| 180 |
-
|
| 181 |
-
| Video Localized Narratives | Video Captioning | video, text | 373 | 0.64 GB |
|
| 182 |
-
| CLEVRER | VQA | video, text | 40'000 | 46.05 GB |
|
| 183 |
-
| NExT-QA | VideoQA | video, text | 10'368 | 57.06 GB |
|
| 184 |
-
| CLEVRER | Video Reasoning | video, text | 42'620 | 49.10 GB |
|
| 185 |
-
| ScreenQA | VQA | image, text | 302'004 | 30.52 GB |
|
| 186 |
-
| WikiSQL | Image Reasoning | image, text | N/A | N/A |
|
| 187 |
-
| WikiTableQuestions | TextQA | text | N/A | N/A |
|
| 188 |
-
| RenderedText | OCR | image, text | N/A | N/A |
|
| 189 |
-
| FinQA | Text Reasoning | text | N/A | N/A |
|
| 190 |
-
| TAT-QA | Text Reasoning | text | N/A | N/A |
|
| 191 |
-
| Databricks Dolly 15K | Text Instruction Tuning | text | N/A | N/A |
|
| 192 |
-
| WebSight | Image Classification | image, text | N/A | N/A |
|
| 193 |
-
| RAVEN | Image Reasoning | image, text | N/A | N/A |
|
| 194 |
-
| VizWiz | VQA | image, text | N/A | N/A |
|
| 195 |
-
| Inter-GPS | Image Reasoning | image, text | N/A | N/A |
|
| 196 |
-
| OCR dataset from arXiv data | OCR | image, text | 120'000 | 49.99 GB |
|
| 197 |
-
| OCR dataset from arXiv data | OCR | image, text | 599'927 | 249.93 GB |
|
| 198 |
-
| OCR dataset from arXiv data | OCR | image, text | 1'565'011 | 1637.79 GB |
|
| 199 |
-
| OCR dataset from arXiv data | OCR | image, text | 418'059 | 422.04 GB |
|
| 200 |
-
| OCR dataset from arXiv data | OCR | image, text | 200'001 | 200.89 GB |
|
| 201 |
-
| OCR dataset from arXiv data | OCR | image, text | 200'000 | 198.94 GB |
|
| 202 |
-
| OCR dataset from arXiv data | OCR | image, text | 200'001 | 196.08 GB |
|
| 203 |
-
| OCR dataset from arXiv data | OCR | image, text | 400'000 | 382.95 GB |
|
| 204 |
-
| OCR dataset from arXiv data | OCR | image, text | 400'000 | 388.16 GB |
|
| 205 |
-
| OCR dataset from arXiv data | OCR | image, text | 18'280 | 20.98 GB |
|
| 206 |
-
| DocLayNet (curated) | OCR | image, text | 48'369 | 18.59 GB |
|
| 207 |
-
| DocLayNet (curated & augmented) | OCR | image, text | 48'249 | 9.12 GB |
|
| 208 |
-
| DocLayNet (curated & augmented) | OCR | image, text | 48'267 | 9.09 GB |
|
| 209 |
-
| SynthTabNet | OCR | image, text | 200'000 | 9.70 GB |
|
| 210 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'309 | 17.00 GB |
|
| 211 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'461 | 7.77 GB |
|
| 212 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'462 | 7.99 GB |
|
| 213 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'236 | 5.84 GB |
|
| 214 |
-
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'232 | 5.92 GB |
|
| 215 |
-
| SynthTables | OCR | image, text | 4'887 | 0.38 GB |
|
| 216 |
-
| TabRecSet | OCR | image, text | 25'281 | 2.46 GB |
|
| 217 |
-
| TabRecSet | OCR | image, text | 25'281 | 1.61 GB |
|
| 218 |
-
| FinTabNet | OCR | image, text | 57'137 | 9.22 GB |
|
| 219 |
-
| FinTabNet | OCR | image, text | 57'131 | 21.76 GB |
|
| 220 |
-
| FinTabNet | OCR | image, text | 57'129 | 21.68 GB |
|
| 221 |
-
| PubTables-1M | OCR | image, text | 224'170 | 29.55 GB |
|
| 222 |
-
| PubTables-1M | OCR | image, text | 224'169 | 36.32 GB |
|
| 223 |
-
| PubTables-1M | OCR | image, text | 225'108 | 36.45 GB |
|
| 224 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 37.13 GB |
|
| 225 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 33.38 GB |
|
| 226 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 32.85 GB |
|
| 227 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 31.15 GB |
|
| 228 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.30 GB |
|
| 229 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 38.40 GB |
|
| 230 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 27.09 GB |
|
| 231 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 29.52 GB |
|
| 232 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.49 GB |
|
| 233 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.14 GB |
|
| 234 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 100.14 GB |
|
| 235 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.82 GB |
|
| 236 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.96 GB |
|
| 237 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.61 GB |
|
| 238 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 89.89 GB |
|
| 239 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 95.75 GB |
|
| 240 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 85.65 GB |
|
| 241 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 91.01 GB |
|
| 242 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.29 GB |
|
| 243 |
-
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 84.66 GB |
|
| 244 |
-
| TextOCR | OCR | image, text | 21'727 | 5.83 GB |
|
| 245 |
-
| TextOCR | OCR | image, text | 21'138 | 2.83 GB |
|
| 246 |
-
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'359 | 12.92 GB |
|
| 247 |
-
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'351 | 14.57 GB |
|
| 248 |
-
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'350 | 14.44 GB |
|
| 249 |
-
| HierText | OCR | image, text | 8'278 | 2.60 GB |
|
| 250 |
-
| FUNSD | OCR | image, text | 149 | 0.01 GB |
|
| 251 |
-
| Gretel Synthetic Safety Alignment | Safety | Text | 19'779 | 0.03 GB |
|
| 252 |
-
| Internal safety alignment multimodal dataset | Safety | image, text | 22'559 | 8.27 GB |
|
| 253 |
-
| ALFRED Action | Safety | video, text | 6'524 | 5.92 GB |
|
| 254 |
-
| ALFRED Goal | Safety | video, text | 6'464 | 5.86 GB |
|
| 255 |
-
| VQA-RAD | Safety | image, text | 1'793 | 0.09 GB |
|
| 256 |
-
| SLAKE | Safety | image, text | 9'835 | 0.85 GB |
|
| 257 |
-
| STEM MMLU-aux (subset) | Safety | text | 37'444 | 0.49 GB |
|
| 258 |
-
| Glaive & Xlam | Function call | text | 8'000 | 0.02 GB |
|
| 259 |
-
| Textbooks VQA | VQA | image, text | 46'745 | 10.85 GB |
|
| 260 |
-
| ai2d | VQA | image, text | 12'413 | 2.23 GB |
|
| 261 |
-
| ScienceQA | VQA | image, text | 12'716 | 0.39 GB |
|
| 262 |
-
| ScienceQA from LlaVA-OneVision | VQA | image, text | 19'196 | 0.65 GB |
|
| 263 |
-
| ChartQA | VQA | image, text | 15'121 | 0.68 GB |
|
| 264 |
-
| ChartQA (augmented) | VQA | image, text | 15'050 | 0.65 GB |
|
| 265 |
-
| ChartQA (CoT) | VQA | image, text | 23'571 | 1.04 GB |
|
| 266 |
-
| ChartQA | VQA | image, text | 60'438 | 2.69 GB |
|
| 267 |
-
| Geo170K | VQA | image, text | 13'263 | 0.07 GB |
|
| 268 |
-
| InfographicVQA | VQA | image, text | 23'946 | 8.21 GB |
|
| 269 |
-
| DocVQA | VQA | image, text | 39'463 | 26.29 GB |
|
| 270 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 16'881 | 10.65 GB |
|
| 271 |
-
| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 524'892 | 96.99 GB |
|
| 272 |
-
| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 227'776 | 42.52 GB |
|
| 273 |
-
| TabMWP | Image Reasoning | image, text | 23'058 | 0.30 GB |
|
| 274 |
-
| PMC-VQA | VQA | image, text | 2'266 | 0.04 GB |
|
| 275 |
-
| OCR-VQA from The Cauldron | VQA | image, text | 165'746 | 5.79 GB |
|
| 276 |
-
| ST-VQA from The Cauldron | VQA | image, text | 17'232 | 0.68 GB |
|
| 277 |
-
| WebSight from The Cauldron | OCR | image, text | 9'809 | 1.84 GB |
|
| 278 |
-
| EST-VQA | VQA | image, text | 17'043 | 4.25 GB |
|
| 279 |
-
| TAL Handwritten English OCR | OCR | image, text | 9'998 | 0.22 GB |
|
| 280 |
-
| TAL Handwritten Math writing | OCR | image, text | 22'244 | 0.33 GB |
|
| 281 |
-
| SlideVQA | VQA | image, text | 5'773 | 0.42 GB |
|
| 282 |
-
| pixmo-docs | VQA | image, text | 251'165 | 34.88 GB |
|
| 283 |
-
| pixmo-cap | Image Captioning | image, text | 706'897 | 261.63 GB |
|
| 284 |
-
| pixmo-cap-qa | VQA | image, text | 214'978 | 56.72 GB |
|
| 285 |
-
| pixmo-ask-model-anything | Visual Instruction Tuning | image, text | 153'592 | 20.50 GB |
|
| 286 |
-
| TallyQA | VQA | image, text | 68'775 | 10.64 GB |
|
| 287 |
-
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 490.37 GB |
|
| 288 |
-
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 488.17 GB |
|
| 289 |
-
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'128'326 | 324.46 GB |
|
| 290 |
-
| TabMWP (CoT) | Image Reasoning | image, text | 20'305 | 0.28 GB |
|
| 291 |
-
| VisualWebInstruct | Visual Instruction Tuning | image, text | 260'419 | 7.41 GB |
|
| 292 |
-
| Internal collection of public text SFT datasets | Text Instruction Tuning | text | 197'938 | 1.04 GB |
|
| 293 |
-
| ReCTS from ICDAR2019 | OCR | image, text | 20'000 | 1.77 GB |
|
| 294 |
-
| RCTW from ICDAR2017 | OCR | image, text | 8'034 | 7.85 GB |
|
| 295 |
-
| OCR equation heavy dataset from arXiv data | OCR | image, text | 2'000 | 0.03 GB |
|
| 296 |
-
| Mulberry-SFT (CoT) | Image Reasoning | image, text | 191'332 | 30.80 GB |
|
| 297 |
-
| LLaVA-CoT-100k (CoT) | Image Reasoning | image, text | 63'013 | 8.18 GB |
|
| 298 |
-
| GeomVerse (CoT) | Image Reasoning | image, text | 9'298 | 0.90 GB |
|
| 299 |
-
| MapQA (CoT) | Image Reasoning | image, text | 16'832 | 1.77 GB |
|
| 300 |
-
| MetaMathQA (CoT) | Text Reasoning | text | 225'408 | 4.55 GB |
|
| 301 |
-
| MetaMathQA (CoT) | Image Reasoning | image, text | 220'544 | 4.48 GB |
|
| 302 |
-
| PlotQA (CoT) | Image Reasoning | image, text | 16'256 | 0.76 GB |
|
| 303 |
-
| Visual7W Telling (CoT) | Image Reasoning | image, text | 62'592 | 3.21 GB |
|
| 304 |
-
| Visual7W Pointing | VQA | image, text | 25'733 | 0.93 GB |
|
| 305 |
-
| VisText | Image Captioning | image, text | 9'969 | 0.52 GB |
|
| 306 |
-
| ScreenQA | VQA | image, text | 32'724 | 3.51 GB |
|
| 307 |
-
| wave-ui-25k | OCR | image, text | 24'978 | 11.44 GB |
|
| 308 |
-
| Charts2500 | VQA | image, text | 2'486 | 0.09 GB |
|
| 309 |
-
| Cyrillic | OCR | image, text | 72'284 | 1.49 GB |
|
| 310 |
-
| CMM-Math | Image Reasoning | image, text | 13'148 | 0.05 GB |
|
| 311 |
-
| SimChart9K | Image Reasoning | image, text | 9'536 | 0.69 GB |
|
| 312 |
-
| UniChart | Image Reasoning | image, text | 504'885 | 17.04 GB |
|
| 313 |
-
| CASIA-HWDB2-line | OCR | image, text | 2'193 | 0.09 GB |
|
| 314 |
-
| MMTab | VQA | image, text | 232'746 | 59.23 GB |
|
| 315 |
-
| ArxivQA | VQA | image, text | 99'995 | 17.32 GB |
|
| 316 |
-
| docmatix-single | VQA | image, text | 19'992 | 3.94 GB |
|
| 317 |
-
| DocReason525K | Image Reasoning | image, text | 25'863 | 33.80 GB |
|
| 318 |
-
| FigureQA | VQA | image, text | 100'000 | 2.37 GB |
|
| 319 |
-
| LRV-Instruction | Visual Instruction Tuning | image, text | 7'198 | 0.37 GB |
|
| 320 |
-
| VisualWebInstruct (CoT) | Image Reasoning | image, text | 48'929 | 4.37 GB |
|
| 321 |
-
| DocMatix (multi-page) | Image Reasoning | image, text | 19'969 | 8.66 GB |
|
| 322 |
-
| spot-the-diff | Image Reasoning | image, text | 8'007 | 1.45 GB |
|
| 323 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 36'333 | 24.32 GB |
|
| 324 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 45'710 | 2.10 GB |
|
| 325 |
-
| DocVQA (CoT) | Image Reasoning | image, text | 19'548 | 6.70 GB |
|
| 326 |
-
| Mulberry-SFT (subset, CoT) | Image Reasoning | image, text | 103'763 | 18.45 GB |
|
| 327 |
-
| UniGeo (CoT) | Image Reasoning | image, text | 9'728 | 0.05 GB |
|
| 328 |
-
| NIGHTS | Image Reasoning | image, text | 12'906 | 37.01 GB |
|
| 329 |
-
| Mantis-Instruct (CoT) | Image Reasoning | image, text | 67'723 | 13.86 GB |
|
| 330 |
-
| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 2'858 | 1.23 GB |
|
| 331 |
-
| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 586 | 0.46 GB |
|
| 332 |
-
| FinTabNet (relabeled) | Image Reasoning | image, text | 8'356 | 3.17 GB |
|
| 333 |
-
| Table OCR on pdfs from CommonCrawl | Image Reasoning | image, text | 4'846 | 3.65 GB |
|
| 334 |
-
| HierText (relabeled for QA) | Image Reasoning | image, text | 514 | 0.07 GB |
|
| 335 |
-
| ECD-10k-Images | Image Reasoning | image, text | 132'613 | 15.38 GB |
|
| 336 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 6'490 | 162.22 GB |
|
| 337 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 5'496 | 11.07 GB |
|
| 338 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 5'492 | 10.99 GB |
|
| 339 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 52 | 0.74 GB |
|
| 340 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 61 | 0.85 GB |
|
| 341 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 6'843 | 27.83 GB |
|
| 342 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 6'843 | 27.85 GB |
|
| 343 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 7'420 | 102.81 GB |
|
| 344 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 3'840 | 25.84 GB |
|
| 345 |
-
| NExT-QA (multi-choice QA) | VideoQA | video, text | 4'633 | 35.38 GB |
|
| 346 |
-
| NExT-QA (open-ended QA) | VideoQA | video, text | 4'694 | 35.84 GB |
|
| 347 |
-
| ActivityNet (open-ended QA) | VideoQA | video, text | 2'580 | 7.46 GB |
|
| 348 |
-
| Perception Test (multi-choice QA) | VideoQA | video, text | 1'785 | 18.67 GB |
|
| 349 |
-
| Perception Test (multi-choice QA) | VideoQA | video, text | 618 | 11.52 GB |
|
| 350 |
-
| NExT-QA | VideoQA | video, text | 34'132 | 150.86 GB |
|
| 351 |
-
| CLEVRER | VideoQA | video, text | 40'000 | 46.03 GB |
|
| 352 |
-
| Video dataset based on Kinetics | VideoQA | video, text | 39'452 | 26.15 GB |
|
| 353 |
-
| EGO4D | VideoQA | video, text | 7'797 | 3.38 GB |
|
| 354 |
-
| TVQA | VideoQA | video, text | 34'868 | 100.05 GB |
|
| 355 |
-
| EgoExoLearn | VideoQA | video, text | 36'373 | 8558.27 GB |
|
| 356 |
-
| Video dataset based on Kinetics | VideoQA | video, text | 647'883 | 890.56 GB |
|
| 357 |
-
| Mementos | VideoQA | video, text | 4'060 | 14.07 GB |
|
| 358 |
-
| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
|
| 359 |
-
| ActivityNet | VideoQA | video, text | 10'021 | 191.49 GB |
|
| 360 |
-
| EGO4D | VideoQA | video, text | 1'506 | 137.00 GB |
|
| 361 |
-
| FineAction | VideoQA | video, text | 7'504 | 169.76 GB |
|
| 362 |
-
| HACS | VideoQA | video, text | 31'223 | 829.25 GB |
|
| 363 |
-
| HiREST | VideoQA | video, text | 822 | 42.50 GB |
|
| 364 |
-
| Perception Test | VideoQA | video, text | 2'135 | 25.98 GB |
|
| 365 |
-
| ActivityNet | VideoQA | video, text | 9'064 | 181.24 GB |
|
| 366 |
-
| HiREST | VideoQA | video, text | 525 | 27.54 GB |
|
| 367 |
-
| YouCook2 | VideoQA | video, text | 1'180 | 77.65 GB |
|
| 368 |
-
| DiDeMo | VideoQA | video, text | 7'452 | 33.90 GB |
|
| 369 |
-
| EGO4D | VideoQA | video, text | 2'665 | 194.01 GB |
|
| 370 |
-
| MedVidQA | VideoQA | video, text | 933 | 40.35 GB |
|
| 371 |
-
| QuerYD | VideoQA | video, text | 1'562 | 50.69 GB |
|
| 372 |
-
| YouCook2 | VideoQA | video, text | 2'270 | 158.77 GB |
|
| 373 |
-
| EgoExoLearn (open-ended QA) | VideoQA | video, text | 9'998 | 1751.69 GB |
|
| 374 |
-
| Breakfast Actions | VideoQA | video, text | 1'204 | 3.45 GB |
|
| 375 |
-
| EgoExoLearn (multi-choice QA) | VideoQA | video, text | 6'832 | 1196.41 GB |
|
| 376 |
-
| CrossTask (multi-choice QA) | VideoQA | video, text | 75'686 | 417.50 GB |
|
| 377 |
-
| CrossTask (open-ended QA) | VideoQA | video, text | 20'399 | 112.02 GB |
|
| 378 |
-
| EgoProceL (multi-choice QA) | VideoQA | video, text | 4'789 | 42.74 GB |
|
| 379 |
-
| EgoProceL (open-ended QA) | VideoQA | video, text | 5'667 | 50.58 GB |
|
| 380 |
-
| HC-STVG (multi-choice QA) | VideoQA | video, text | 147'799 | 796.18 GB |
|
| 381 |
-
| HC-STVG (open-ended QA) | VideoQA | video, text | 41'050 | 221.82 GB |
|
| 382 |
-
| TAPOS (multi-choice QA) | VideoQA | video, text | 33'941 | 218.50 GB |
|
| 383 |
-
| TAPOS (open-ended QA) | VideoQA | video, text | 13'991 | 88.00 GB |
|
| 384 |
-
| Multi-page OCR based on CommonCrawl pdf data | VQA | image, text | 7'262 | 48.19 GB |
|
| 385 |
-
| Multi-page QA based on CommonCrawl pdf data | VQA | image, text | 455 | 31.88 GB |
|
| 386 |
-
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'281 | 0.68 GB |
|
| 387 |
-
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'285 | 0.67 GB |
|
| 388 |
-
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'282 | 0.67 GB |
|
| 389 |
-
| Selection of public datasets (relabeled) | Image Reasoning | image, text | 13'843 | 4.18 GB |
|
| 390 |
-
| Selection of public datasets (relabeled) | Image Reasoning | image, text | 18'442 | 3.89 GB |
|
| 391 |
-
| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
|
| 392 |
-
| Perception Test (CoT) | VideoQA | video, text | 4'977 | 64.55 GB |
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
<br>
|
| 396 |
-
|
| 397 |
# Private Datasets <br>
|
| 398 |
-
|
|
| 399 |
-
|
| 400 |
-
|
|
| 401 |
-
|
|
| 402 |
-
|
|
| 403 |
-
|
| 404 |
-
| Internal QA dataset on invoices | Image Reasoning | image, text | 11'258 | 10.19 GB |
|
| 405 |
-
<br>
|
| 406 |
|
| 407 |
# Data Crawling and Scraping <br>
|
| 408 |
-
|
|
| 409 |
-
|
| 410 |
-
|
|
| 411 |
-
|
|
| 412 |
-
|
|
| 413 |
-
|
|
| 414 |
-
| Internal VQA dataset | VQA | image, text | 20'098 | 2.07 GB |
|
| 415 |
-
| Internal Captioning dataset | Image Captioning | image, text | 24'998 | 6.97 GB |
|
| 416 |
-
<br>
|
| 417 |
|
| 418 |
# User-Sourced Data (Collected by Provider including Prompts) <br>
|
| 419 |
<br>
|
| 420 |
|
| 421 |
# Self-Sourced Synthetic Data <br>
|
| 422 |
-
|
|
| 423 |
-
|
| 424 |
-
|
|
| 425 |
-
|
|
| 426 |
-
|
|
| 427 |
-
|
|
| 428 |
-
| Random English characters for OCR | OCR | image, text | 14'525 | 5.65 GB |
|
| 429 |
-
| Random English characters for OCR | OCR | image, text | 14'525 | 9.39 GB |
|
| 430 |
-
| Synthetic sparse table dataset | OCR | image, text | 100'000 | 14.36 GB |
|
| 431 |
-
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 1'165'591 | 54.15 GB |
|
| 432 |
-
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 175'000 | 0.95 GB |
|
| 433 |
-
| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 1'922'012 | 28.00 GB |
|
| 434 |
-
| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 288'000 | 0.57 GB |
|
| 435 |
-
| Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 67'000 | 0.22 GB |
|
| 436 |
-
| Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 403'619 | 6.55 GB |
|
| 437 |
-
| Synthetic safety data with responses from DeepSeek-R1-0528 | Text Reasoning | text | 30'710 | 0.12 GB |
|
| 438 |
-
| Dummy conversation dataset | Text Reasoning | text | 2'262 | 0.00 GB |
|
| 439 |
-
| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 32'752 | 0.26 GB |
|
| 440 |
-
| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 3'636 | 0.01 GB |
|
| 441 |
-
| Synthetic chat dataset with responses from DeepSeek-R1 | Text Reasoning | text | 389'350 | 3.30 GB |
|
| 442 |
-
| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 353'526 | 2.61 GB |
|
| 443 |
-
| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 361'733 | 1.12 GB |
|
| 444 |
-
| Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct | Text Reasoning | text | 4'999'794 | 86.68 GB |
|
| 445 |
-
| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 545'844 | 5.25 GB |
|
| 446 |
-
| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 81'876 | 0.43 GB |
|
| 447 |
-
| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 1'591'641 | 58.63 GB |
|
| 448 |
-
| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 239'467 | 0.52 GB |
|
| 449 |
-
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Code | text | 1'165'591 | 54.15 GB |
|
| 450 |
-
| Synthetic tool calling dataset from DeepSeek-R1-0528 | Text Reasoning | text | 74'044 | 46.43 GB |
|
| 451 |
-
<br>
|
| 452 |
-
|
| 453 |
-
|
| 454 |
|
| 455 |
|
| 456 |
**Properties**<br>
|
|
|
|
| 44 |
|
| 45 |
## Release Date:
|
| 46 |
|
| 47 |
+
- Build.Nvidia.com [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
|
| 48 |
+
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-BF16](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16)
|
| 49 |
+
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8)
|
| 50 |
+
- Hugging Face [October 28th, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD)
|
| 51 |
|
| 52 |
## Model Architecture:
|
| 53 |
|
|
|
|
| 161 |
** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
|
| 162 |
|
| 163 |
# Public Datasets <br>
|
| 164 |
+
| Type | Data Type | Total Samples | Total Size (GB) |
|
| 165 |
+
|------|-----------|---------------|------------------|
|
| 166 |
+
| Function call | text | 8,000 | 0.02 |
|
| 167 |
+
| Image Captioning | image, text | 1,422,102 | 1,051.04 |
|
| 168 |
+
| Image Reasoning | image, text | 1,888,217 | 286.95 |
|
| 169 |
+
| OCR | image, text | 9,830,570 | 5,317.60 |
|
| 170 |
+
| Referring Expression Grounding | image, text | 14,694 | 2.39 |
|
| 171 |
+
| Safety | image, text | 34,187 | 9.21 |
|
| 172 |
+
| Safety | text | 57,223 | 0.52 |
|
| 173 |
+
| Safety | video, text | 12,988 | 11.78 |
|
| 174 |
+
| Text Instruction Tuning | text | 245,056 | 1.13 |
|
| 175 |
+
| Text Reasoning | text | 225,408 | 4.55 |
|
| 176 |
+
| VQA | image, text | 8,174,136 | 2,207.52 |
|
| 177 |
+
| VQA | video, text | 40,000 | 46.05 |
|
| 178 |
+
| Video Captioning | video, text | 3,289 | 6.31 |
|
| 179 |
+
| Video Reasoning | video, text | 42,620 | 49.10 |
|
| 180 |
+
| VideoQA | video, text | 1,371,923 | 17,641.79 |
|
| 181 |
+
| Visual Instruction Tuning | image, text | 1,173,877 | 167.79 |
|
| 182 |
+
| **TOTAL** | | **24,544,290** | **26,803.75** |
|
| 183 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
# Private Datasets <br>
|
| 185 |
+
| Type | Modalities | Total Samples | Total Size (GB) |
|
| 186 |
+
|------|------------|---------------|------------------|
|
| 187 |
+
| Image Reasoning | image, text | 17,729 | 15.41 |
|
| 188 |
+
| Text Reasoning | text | 445,958 | 9.01 |
|
| 189 |
+
| **TOTAL** | | **463,687** | **24.42** |
|
| 190 |
+
|
|
|
|
|
|
|
| 191 |
|
| 192 |
# Data Crawling and Scraping <br>
|
| 193 |
+
| Type | Modalities | Total Samples | Total Size (GB) |
|
| 194 |
+
|------|------------|---------------|------------------|
|
| 195 |
+
| Image Captioning | image, text | 39,870 | 10.24 |
|
| 196 |
+
| VQA | image, text | 40,348 | 3.94 |
|
| 197 |
+
| VideoQA | video, text | 288,728 | 393.30 |
|
| 198 |
+
| **TOTAL** | | **368,946** | **407.48** |
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
# User-Sourced Data (Collected by Provider including Prompts) <br>
|
| 201 |
<br>
|
| 202 |
|
| 203 |
# Self-Sourced Synthetic Data <br>
|
| 204 |
+
| Type | Data Type | Total Samples | Total Size (GB) |
|
| 205 |
+
|------|-----------|---------------|------------------|
|
| 206 |
+
| Code | text | 1,165,591 | 54.15 |
|
| 207 |
+
| OCR | image, text | 216,332 | 83.53 |
|
| 208 |
+
| Text Reasoning | text | 12,727,857 | 295.80 |
|
| 209 |
+
| **TOTAL** | | **14,109,780** | **433.48** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
|
| 212 |
**Properties**<br>
|