Metadata-Version: 2.2
Name: kreuzberg
Version: 1.4.0
Summary: A text extraction library supporting PDFs, images, office documents and more
Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
License: MIT
Project-URL: homepage, https://github.com/Goldziher/kreuzberg
Keywords: document-processing,docx,image-to-text,latex,markdown,ocr,odt,office-documents,pandoc,pdf,pdf-extraction,rag,text-extraction,text-processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anyio>=4.8.0
Requires-Dist: charset-normalizer>=3.4.1
Requires-Dist: html-to-markdown>=1.2.0
Requires-Dist: pypandoc>=1.15
Requires-Dist: pypdfium2>=4.30.1
Requires-Dist: python-pptx>=1.0.2

# Kreuzberg

Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text
extraction.

Why?

I am building, like many do now, a RAG focused service (checkout https://grantflow.ai). I have text extraction needs.
There are quite a lot of commercial options out there, and several open-source + paid options.
But I wanted something simple, which does not require expansive round-trips to an external API.
Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.

Hence, this library.

## Features

- Extract text from PDFs, images, office documents and more (see supported formats below)
- Use modern Python with async (via `anyio`) and proper type hints
- Extensive error handling for easy debugging

## Installation

1. Begin by installing the python package:

   ```shell

   pip install kreuzberg

   ```

2. Install the system dependencies:

- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)
- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)

## Dependencies and Philosophy

This library is built to be minimalist and simple. It also aims to utilize OSS tools for the job. Its fundamentally a
high order async abstraction on top of other tools, think of it like the library you would bake in your code base, but
polished and well maintained.

### Dependencies

- PDFs are processed using pdfium2 for searchable PDFs + Tesseract OCR for scanned documents
- Images are processed using Tesseract OCR
- Office documents and other formats are processed using Pandoc
- PPTX files are converted using python-pptx
- HTML files are converted using html-to-markdown
- Plain text files are read directly with appropriate encoding detection

### Roadmap

V1:

- [x] - html file text extraction
- [ ] - better PDF table extraction
- [ ] - TBD

V2:

- [ ] - extra install groups (to make dependencies optional)
- [ ] - metadata extraction (possible breaking change)
- [ ] - TBD

### Feature Requests

Feel free to open a discussion in GitHub or an issue if you have any feature requests

### Contribution

Is welcome! Read guidelines below.

## Supported File Types

Kreuzberg supports a wide range of file formats:

### Document Formats

- PDF (`.pdf`) - both searchable and scanned documents
- Word Documents (`.docx`, `.doc`)
- Power Point Presentations (`.pptx`)
- OpenDocument Text (`.odt`)
- Rich Text Format (`.rtf`)

### Image Formats

- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)
- PNG (`.png`)
- TIFF (`.tiff`, `.tif`)
- BMP (`.bmp`)
- GIF (`.gif`)
- WebP (`.webp`)
- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
- Portable Anymap (`.pnm`)
- Portable Bitmap (`.pbm`)
- Portable Graymap (`.pgm`)
- Portable Pixmap (`.ppm`)

#### Text and Markup Formats

- HTML (`.html`, `.htm`)
- Plain Text (`.txt`)
- Markdown (`.md`)
- reStructuredText (`.rst`)
- LaTeX (`.tex`)

#### Data Formats

- Comma-Separated Values (`.csv`)
- Tab-Separated Values (`.tsv`)

## Usage

Kreuzberg exports two async functions:

- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`
- Extract text from a byte-string using `extract_bytes()`

### Extract from File

```python
from pathlib import Path
from kreuzberg import extract_file


# Extract text from a PDF file
async def extract_pdf():
    result = await extract_file("document.pdf")
    print(f"Extracted text: {result.content}")
    print(f"Output mime type: {result.mime_type}")


# Extract text from an image
async def extract_image():
    result = await extract_file("scan.png")
    print(f"Extracted text: {result.content}")


# or use Path

async def extract_pdf():
    result = await extract_file(Path("document.pdf"))
    print(f"Extracted text: {result.content}")
    print(f"Output mime type: {result.mime_type}")
```

### Extract from Bytes

```python
from kreuzberg import extract_bytes


# Extract text from PDF bytes
async def process_uploaded_pdf(pdf_content: bytes):
    result = await extract_bytes(pdf_content, mime_type="application/pdf")
    return result.content


# Extract text from image bytes
async def process_uploaded_image(image_content: bytes):
    result = await extract_bytes(image_content, mime_type="image/jpeg")
    return result.content
```

### Forcing OCR

When extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.
You can do this by passing `force_ocr=True`:

```python
from kreuzberg import extract_bytes


# Extract text from PDF bytes and force OCR
async def process_uploaded_pdf(pdf_content: bytes):
    result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)
    return result.content
```

### Error Handling

Kreuzberg raises two exception types:

#### ValidationError

Raised when there are issues with input validation:

- Unsupported mime types
- Undetectable mime types
- Path doesn't point at an exist file

#### ParsingError

Raised when there are issues during the text extraction process:

- PDF parsing failures
- OCR errors
- Pandoc conversion errors

```python
from kreuzberg import extract_file
from kreuzberg.exceptions import ValidationError, ParsingError


async def safe_extract():
    try:
        result = await extract_file("document.doc")
        return result.content
    except ValidationError as e:
        print(f"Validation error: {e.message}")
        print(f"Context: {e.context}")
    except ParsingError as e:
        print(f"Parsing error: {e.message}")
        print(f"Context: {e.context}")  # Contains detailed error information
```

Both error types include helpful context information for debugging:

```python
try:
    result = await extract_file("scanned.pdf")
except ParsingError as e:
# e.context might contain:
# {
#    "file_path": "scanned.pdf",
#    "error": "Tesseract OCR failed: Unable to process image"
# }
```

### ExtractionResult

All extraction functions return an ExtractionResult named tuple containing:

- `content`: The extracted text as a string
- `mime_type`: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")

```python
from kreuzberg import ExtractionResult


async def process_document(path: str) -> str:
    result: ExtractionResult = await extract_file(path)
    return result.content


# or access the result as tuple

async def process_document(path: str) -> str:
    content, mime_type = await extract_file(path)
    # do something with mime_type
    return content
```

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo
2. Install the system dependencies
3. Install the full dependencies with `uv sync`
4. Install the pre-commit hooks with:
   ```shell
   pre-commit install && pre-commit install --hook-type commit-msg
   ```
5. Make your changes and submit a PR

## License

This library uses the MIT license.
