Metadata-Version: 2.4
Name: natural-pdf
Version: 0.1.12
Summary: A more intuitive interface for working with PDFs
Author-email: Jonathan Soma <jonathan.soma@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/jsoma/natural-pdf
Project-URL: Repository, https://github.com/jsoma/natural-pdf
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pdfplumber
Requires-Dist: pillow
Requires-Dist: colour
Requires-Dist: numpy
Requires-Dist: urllib3
Requires-Dist: tqdm
Requires-Dist: pydantic
Requires-Dist: jenkspy
Requires-Dist: pikepdf>=9.7.0
Provides-Extra: viewer
Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "viewer"
Provides-Extra: easyocr
Requires-Dist: easyocr; extra == "easyocr"
Requires-Dist: natural-pdf[core-ml]; extra == "easyocr"
Provides-Extra: paddle
Requires-Dist: paddlepaddle; extra == "paddle"
Requires-Dist: paddleocr; extra == "paddle"
Provides-Extra: layout-yolo
Requires-Dist: doclayout_yolo; extra == "layout-yolo"
Requires-Dist: natural-pdf[core-ml]; extra == "layout-yolo"
Provides-Extra: surya
Requires-Dist: surya-ocr; extra == "surya"
Requires-Dist: natural-pdf[core-ml]; extra == "surya"
Provides-Extra: doctr
Requires-Dist: python-doctr[torch]; extra == "doctr"
Requires-Dist: natural-pdf[core-ml]; extra == "doctr"
Provides-Extra: docling
Requires-Dist: docling; extra == "docling"
Requires-Dist: natural-pdf[core-ml]; extra == "docling"
Provides-Extra: llm
Requires-Dist: openai>=1.0; extra == "llm"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Provides-Extra: search
Requires-Dist: lancedb; extra == "search"
Requires-Dist: pyarrow; extra == "search"
Provides-Extra: favorites
Requires-Dist: natural-pdf[deskew]; extra == "favorites"
Requires-Dist: natural-pdf[llm]; extra == "favorites"
Requires-Dist: natural-pdf[surya]; extra == "favorites"
Requires-Dist: natural-pdf[easyocr]; extra == "favorites"
Requires-Dist: natural-pdf[layout_yolo]; extra == "favorites"
Requires-Dist: natural-pdf[ocr-export]; extra == "favorites"
Requires-Dist: natural-pdf[viewer]; extra == "favorites"
Requires-Dist: natural-pdf[search]; extra == "favorites"
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: nox; extra == "dev"
Requires-Dist: nox-uv; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: uv; extra == "dev"
Requires-Dist: pipdeptree; extra == "dev"
Requires-Dist: nbformat; extra == "dev"
Requires-Dist: jupytext; extra == "dev"
Requires-Dist: nbclient; extra == "dev"
Requires-Dist: ipykernel; extra == "dev"
Provides-Extra: deskew
Requires-Dist: deskew>=1.5; extra == "deskew"
Requires-Dist: img2pdf; extra == "deskew"
Provides-Extra: all
Requires-Dist: natural-pdf[viewer]; extra == "all"
Requires-Dist: natural-pdf[easyocr]; extra == "all"
Requires-Dist: natural-pdf[paddle]; extra == "all"
Requires-Dist: natural-pdf[layout_yolo]; extra == "all"
Requires-Dist: natural-pdf[surya]; extra == "all"
Requires-Dist: natural-pdf[doctr]; extra == "all"
Requires-Dist: natural-pdf[ocr-export]; extra == "all"
Requires-Dist: natural-pdf[docling]; extra == "all"
Requires-Dist: natural-pdf[llm]; extra == "all"
Requires-Dist: natural-pdf[core-ml]; extra == "all"
Requires-Dist: natural-pdf[deskew]; extra == "all"
Requires-Dist: natural-pdf[test]; extra == "all"
Requires-Dist: natural-pdf[search]; extra == "all"
Provides-Extra: core-ml
Requires-Dist: torch; extra == "core-ml"
Requires-Dist: torchvision; extra == "core-ml"
Requires-Dist: transformers[sentencepiece]; extra == "core-ml"
Requires-Dist: huggingface_hub; extra == "core-ml"
Requires-Dist: sentence-transformers; extra == "core-ml"
Requires-Dist: numpy; extra == "core-ml"
Requires-Dist: timm; extra == "core-ml"
Provides-Extra: ocr-export
Requires-Dist: pikepdf; extra == "ocr-export"
Provides-Extra: export-extras
Requires-Dist: jupytext; extra == "export-extras"
Requires-Dist: nbformat; extra == "export-extras"
Dynamic: license-file

# Natural PDF

A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).

Natural PDF lets you find and extract content from PDFs using simple code that makes sense.

- [Complete documentation here](https://jsoma.github.io/natural-pdf)
- [Live demos here](https://colab.research.google.com/github/jsoma/natural-pdf/)

<div style="max-width: 400px; margin: auto"><a href="sample-screen.png"><img src="sample-screen.png"></a></div>

## Installation

```bash
pip install natural-pdf
```

For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install one to two million different extras. If you just want the greatest hits:

```bash
# deskewing, OCR (surya) + layout analysis (yolo), interactive browsing
pip install natural-pdf[favorites]
```

See the [installation guide](https://jsoma.github.io/natural-pdf/installation/) for more details on extras.

## Quick Start

```python
from natural_pdf import PDF

# Open a PDF
pdf = PDF('document.pdf')
page = pdf.pages[0]

# Extract all of the text on the page
page.extract_text()

# Find elements using CSS-like selectors
heading = page.find('text:contains("Summary"):bold')

# Extract content below the heading
content = heading.below().extract_text()

# Examine all the bold text on the page
page.find_all('text:bold').show()

# Exclude parts of the page from selectors/extractors
header = page.find('text:contains("CONFIDENTIAL")').above()
footer = page.find_all('line')[-1].below()
page.add_exclusion(header)
page.add_exclusion(footer)

# Extract clean text from the page ignoring exclusions
clean_text = page.extract_text()
```

And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.

## Key Features

Natural PDF offers a range of features for working with PDFs:

*   **CSS-like Selectors:** Find elements using intuitive query strings (`page.find('text:bold')`).
*   **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
*   **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
*   **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
*   **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
*   **Document QA:** Ask natural language questions about your document's content.
*   **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
*   **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.

## Learn More

Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).

## Best friends

Natural PDF sits on top of a *lot* of fantastic tools and mdoels, some of which are:

- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [EasyOCR](https://www.jaided.ai/easyocr/)
- [PaddleOCR](https://paddlepaddle.github.io/PaddleOCR/latest/en/index.html)
- [Surya](https://github.com/VikParuchuri/surya)
- A specific [YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [deskew](https://github.com/sbrunner/deskew)
- [doctr](https://github.com/mindee/doctr)
- [docling](https://github.com/docling-project/docling)
- [Hugging Face](https://huggingface.co/models)
