Metadata-Version: 2.4
Name: natural-pdf
Version: 0.1.9
Summary: A more intuitive interface for working with PDFs
Author-email: Jonathan Soma <jonathan.soma@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/jsoma/natural-pdf
Project-URL: Repository, https://github.com/jsoma/natural-pdf
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pdfplumber
Requires-Dist: Pillow
Requires-Dist: colour
Requires-Dist: numpy
Requires-Dist: urllib3
Requires-Dist: tqdm
Requires-Dist: pydantic
Provides-Extra: interactive
Requires-Dist: ipywidgets<9.0.0,>=7.0.0; extra == "interactive"
Provides-Extra: haystack
Requires-Dist: haystack-ai; extra == "haystack"
Requires-Dist: lancedb-haystack; extra == "haystack"
Requires-Dist: lancedb; extra == "haystack"
Requires-Dist: sentence-transformers; extra == "haystack"
Requires-Dist: natural-pdf[core-ml]; extra == "haystack"
Provides-Extra: easyocr
Requires-Dist: easyocr; extra == "easyocr"
Requires-Dist: natural-pdf[core-ml]; extra == "easyocr"
Provides-Extra: paddle
Requires-Dist: paddlepaddle; extra == "paddle"
Requires-Dist: paddleocr; extra == "paddle"
Provides-Extra: layout-yolo
Requires-Dist: doclayout_yolo; extra == "layout-yolo"
Requires-Dist: natural-pdf[core-ml]; extra == "layout-yolo"
Provides-Extra: surya
Requires-Dist: surya-ocr; extra == "surya"
Requires-Dist: natural-pdf[core-ml]; extra == "surya"
Provides-Extra: doctr
Requires-Dist: python-doctr[torch]; extra == "doctr"
Requires-Dist: natural-pdf[core-ml]; extra == "doctr"
Provides-Extra: qa
Requires-Dist: natural-pdf[core-ml]; extra == "qa"
Provides-Extra: docling
Requires-Dist: docling; extra == "docling"
Requires-Dist: natural-pdf[core-ml]; extra == "docling"
Provides-Extra: llm
Requires-Dist: openai>=1.0; extra == "llm"
Provides-Extra: classification
Requires-Dist: sentence-transformers; extra == "classification"
Requires-Dist: timm; extra == "classification"
Requires-Dist: natural-pdf[core-ml]; extra == "classification"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: nox; extra == "dev"
Requires-Dist: nox-uv; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: uv; extra == "dev"
Requires-Dist: pipdeptree; extra == "dev"
Requires-Dist: nbformat; extra == "dev"
Requires-Dist: jupytext; extra == "dev"
Requires-Dist: nbclient; extra == "dev"
Provides-Extra: deskew
Requires-Dist: deskew>=1.5; extra == "deskew"
Requires-Dist: img2pdf; extra == "deskew"
Provides-Extra: all
Requires-Dist: natural-pdf[interactive]; extra == "all"
Requires-Dist: natural-pdf[haystack]; extra == "all"
Requires-Dist: natural-pdf[easyocr]; extra == "all"
Requires-Dist: natural-pdf[paddle]; extra == "all"
Requires-Dist: natural-pdf[layout_yolo]; extra == "all"
Requires-Dist: natural-pdf[surya]; extra == "all"
Requires-Dist: natural-pdf[doctr]; extra == "all"
Requires-Dist: natural-pdf[qa]; extra == "all"
Requires-Dist: natural-pdf[ocr-export]; extra == "all"
Requires-Dist: natural-pdf[docling]; extra == "all"
Requires-Dist: natural-pdf[llm]; extra == "all"
Requires-Dist: natural-pdf[classification]; extra == "all"
Requires-Dist: natural-pdf[deskew]; extra == "all"
Requires-Dist: natural-pdf[test]; extra == "all"
Provides-Extra: core-ml
Requires-Dist: torch; extra == "core-ml"
Requires-Dist: torchvision; extra == "core-ml"
Requires-Dist: transformers[sentencepiece]; extra == "core-ml"
Requires-Dist: huggingface_hub; extra == "core-ml"
Provides-Extra: ocr-export
Requires-Dist: ocrmypdf; extra == "ocr-export"
Requires-Dist: pikepdf; extra == "ocr-export"
Provides-Extra: export-extras
Requires-Dist: jupytext; extra == "export-extras"
Requires-Dist: nbformat; extra == "export-extras"
Dynamic: license-file

# Natural PDF

A friendly library for working with PDFs, built on top of [pdfplumber](https://github.com/jsvine/pdfplumber).

Natural PDF lets you find and extract content from PDFs using simple code that makes sense.

- [Complete documentation here](https://jsoma.github.io/natural-pdf)
- [Live demos here](https://colab.research.google.com/github/jsoma/natural-pdf/)

<div style="max-width: 400px; margin: auto"><a href="sample-screen.png"><img src="sample-screen.png"></a></div>

## Installation

```bash
pip install natural-pdf
```

For optional features like specific OCR engines, layout analysis models, or the interactive Jupyter widget, you can install extras:

```bash
# Example: Install with EasyOCR support
pip install natural-pdf[easyocr]
pip install natural-pdf[surya]
pip install natural-pdf[paddle]

# Example: Install support for features using Large Language Models (e.g., via OpenAI-compatible APIs)
pip install natural-pdf[llm]
# (May require setting API key environment variables, e.g., GOOGLE_API_KEY for Gemini)

# Example: Install with interactive viewer support
pip install natural-pdf[interactive]

# Example: Install with semantic search support (Haystack)
pip install natural-pdf[haystack]

# Install everything
pip install natural-pdf[all]
```

See the [installation guide](https://jsoma.github.io/natural-pdf/installation/) for more details on extras.

## Quick Start

```python
from natural_pdf import PDF

# Open a PDF
pdf = PDF('document.pdf')
page = pdf.pages[0]

# Find elements using CSS-like selectors
heading = page.find('text:contains("Summary"):bold')

# Extract content below the heading
content = heading.below().extract_text()
print("Content below Summary:", content[:100] + "...")

# Exclude headers/footers automatically (example)
# You might define these based on common text or position
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())

# Extract clean text from the page
clean_text = page.extract_text()
print("\nClean page text:", clean_text[:200] + "...")

# Highlight the heading and view the page
heading.highlight(color='red')
page.to_image()
```

And as a fun bonus, `page.viewer()` will provide an interactive method to explore the PDF.

## Key Features

Natural PDF offers a range of features for working with PDFs:

*   **CSS-like Selectors:** Find elements using intuitive query strings (`page.find('text:bold')`).
*   **Spatial Navigation:** Select content relative to other elements (`heading.below()`, `element.select_until(...)`).
*   **Text & Table Extraction:** Get clean text or structured table data, automatically handling exclusions.
*   **OCR Integration:** Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
*   **Layout Analysis:** Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
*   **Document QA:** Ask natural language questions about your document's content.
*   **Semantic Search:** Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
*   **Visual Debugging:** Highlight elements and use an interactive viewer or save images to understand your selections.

## Learn More

Dive deeper into the features and explore advanced usage in the [**Complete Documentation**](https://jsoma.github.io/natural-pdf).
