# PDF

This covers how to load pdfs into a document format that we can use downstream.

## Using PyPDF

Allows for tracking of page numbers as well.

In [1]:
from langchain.document_loaders import PagedPDFSplitter

loader = PagedPDFSplitter("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

In [4]:
pages[0]

Document(page_content='LayoutParser : A Uni\x0ced Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\nfmelissadell,jacob carlson g@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model con\x0cgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\ne\x0borts to improve reusability and simplify deep learning (DL) model\ndevelo

An advantage of this approach is that documents can be retrieved with page numbers.

In [9]:
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content)

9: 10 Z. Shen et al.
Fig. 4: Illustration of (a) the original historical Japanese document with layout
detection results and (b) a recreated version of the document image that achieves
much better character recognition recall. The reorganization algorithm rearranges
the tokens based on the their detected bounding boxes given a maximum allowed
height.
4LayoutParser Community Platform
Another focus of LayoutParser is promoting the reusability of layout detection
models and full digitization pipelines. Similar to many existing deep learning
libraries, LayoutParser comes with a community model hub for distributing
layout models. End-users can upload their self-trained models to the model hub,
and these models can be loaded into a similar interface as the currently available
LayoutParser pre-trained models. For example, the model trained on the News
Navigator dataset [17] has been incorporated in the model hub.
Beyond DL models, LayoutParser also promotes the sharing of entire doc-
ument di

## Using Unstructured

In [3]:
from langchain.document_loaders import UnstructuredPDFLoader

In [4]:
loader = UnstructuredPDFLoader("example_data/layout-parser-paper.pdf")

In [None]:
data = loader.load()

### Retain Elements

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode="elements"`.

In [None]:
loader = UnstructuredPDFLoader("example_data/layout-parser-paper.pdf", mode="elements")

In [None]:
data = loader.load()

In [None]:
data[0]

## Using PDFMiner

In [7]:
from langchain.document_loaders import PDFMinerLoader

In [8]:
loader = PDFMinerLoader("example_data/layout-parser-paper.pdf")

In [9]:
data = loader.load()