๐Ÿ“„ PDF Support in the Hugging Face Dataset Viewer

Community Article Published June 25, 2025

PDFs are a common format for sharing unstructured content โ€” including legal documents, research papers, digitized books, and scanned reports. Until now, working with PDF-based datasets on the Hub required downloading files and relying on external tools to inspect or process them.

Hugging Face Dataset Viewer now supports native PDF rendering โ€” allowing users to preview and interact with documents directly in the browser.

๐Ÿ”ง New Viewer Capabilities

  • Thumbnail previews: The viewer now generates and displays a thumbnail (cover page) for each PDF file.

image/png

  • Inline rendering: PDFs can be opened and browsed directly within the viewer, without requiring local downloads.

image/png

image/png

These capabilities improve transparency, reproducibility, and early-stage dataset inspection โ€” particularly in document-heavy domains.

๐Ÿ Programmatic PDF Processing with datasets and pdfplumber

Starting from datasets version 3.5.0, PDF content can be loaded as typed objects using the Pdf feature.

Each entry is represented as a PdfDocument, supporting:

  • Page-level navigation
  • Text extraction
  • Table detection
  • Embedded image access
  • Thumbnail rendering

Low-level operations are powered by pdfplumber, which handles PDF parsing internally.

โ–ถ๏ธ Example

from datasets import load_dataset

# Load a dataset with PDF files
dataset = load_dataset("GOAT-AI/generated-novels", split="train")

# Access the first PDF document
first_pdf = dataset['pdf'][0]

# Check total number of pages
print(f"Total pages: {len(first_pdf.pages)}")

# Generate a thumbnail image
cover_image = first_pdf.pages[0].to_image()
# Optionally save: cover_image.save("cover.png")

# Extract text from the second page
if len(first_pdf.pages) > 1:
    page_text = first_pdf.pages[1].extract_text()
    print("Page 2 text:", page_text)

โœ… Conclusion

Support for PDFs in both the Dataset Viewer and datasets library streamlines exploration and programmatic processing of document-based datasets. This is especially useful in workflows involving legal documents, OCR pipelines, large-scale report mining, or research paper analysis.

For more information and tooling:

Community

Sign up or log in to comment