nanonets/Nanonets-OCR-s · Details for dataset generation

To train our new Visual-Language Model (VLM) for precise optical character recognition (OCR), we curated a dataset comprising over 250,000 pages. The dataset includes the following document types: research papers, financial documents, legal documents, healthcare documents, tax forms, receipts, and invoices. Additionally, the collection features documents containing images, plots, equations, signatures, watermarks, checkboxes, and complex tables.

We have used both synthetic and manually annotated datasets. We first trained the model on the synthetic dataset and then fine-tuned it on the manually annotated dataset.

Can you share more about the data sources? How did you collect the documents and how did you create the ground truth?
Are you able to share the datasets as well (either manual annotations or synthetic ones)?