Metadata-Version: 2.4
Name: kreuzberg
Version: 3.4.0
Summary: A text extraction library supporting PDFs, images, office documents and more
Project-URL: homepage, https://github.com/Goldziher/kreuzberg
Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
License: MIT
License-File: LICENSE
Keywords: document-processing,image-to-text,ocr,pandoc,pdf-extraction,rag,table-extraction,tesseract,text-extraction,text-processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.13
Requires-Dist: anyio>=4.9.0
Requires-Dist: charset-normalizer>=3.4.2
Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
Requires-Dist: html-to-markdown>=1.4.0
Requires-Dist: msgspec>=0.18.0
Requires-Dist: playa-pdf>=0.6.1
Requires-Dist: psutil>=7.0.0
Requires-Dist: pypdfium2==4.30.0
Requires-Dist: python-calamine>=0.3.2
Requires-Dist: python-pptx>=1.0.2
Requires-Dist: typing-extensions>=4.14.0; python_version < '3.12'
Provides-Extra: all
Requires-Dist: click>=8.2.1; extra == 'all'
Requires-Dist: easyocr>=1.7.2; extra == 'all'
Requires-Dist: gmft>=0.4.2; extra == 'all'
Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.1.6; extra == 'all'
Requires-Dist: paddleocr>=3.1.0; extra == 'all'
Requires-Dist: paddlepaddle>=3.1.0; extra == 'all'
Requires-Dist: rich>=14.0.0; extra == 'all'
Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'all'
Requires-Dist: setuptools>=80.9.0; extra == 'all'
Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
Provides-Extra: api
Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.1.6; extra == 'api'
Provides-Extra: chunking
Requires-Dist: semantic-text-splitter>=0.27.0; extra == 'chunking'
Provides-Extra: cli
Requires-Dist: click>=8.2.1; extra == 'cli'
Requires-Dist: rich>=14.0.0; extra == 'cli'
Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
Provides-Extra: gmft
Requires-Dist: gmft>=0.4.2; extra == 'gmft'
Provides-Extra: paddleocr
Requires-Dist: paddleocr>=3.1.0; extra == 'paddleocr'
Requires-Dist: paddlepaddle>=3.1.0; extra == 'paddleocr'
Requires-Dist: setuptools>=80.9.0; extra == 'paddleocr'
Description-Content-Type: text/markdown

# Kreuzberg

[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
[![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Kreuzberg is a **high-performance** Python library for text extraction from documents. **Benchmarked as one of the fastest text extraction libraries available**, it provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs optimized for speed and efficiency.

## Why Kreuzberg?

- **🚀 Substantially Faster**: Extraction speeds that significantly outperform other text extraction libraries
- **⚡ Unique Dual API**: The only framework supporting both sync and async APIs for maximum flexibility
- **💾 Memory Efficient**: Lower memory footprint compared to competing libraries
- **📊 Proven Performance**: [Comprehensive benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) demonstrate superior performance across formats
- **Simple and Hassle-Free**: Clean API that just works, without complex configuration
- **Local Processing**: No external API calls or cloud dependencies required
- **Resource Efficient**: Lightweight processing without GPU requirements
- **Format Support**: Comprehensive support for documents, images, and text formats
- **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR
- **Command Line Interface**: Powerful CLI for batch processing and automation
- **Metadata Extraction**: Get document metadata alongside text content
- **Table Extraction**: Extract tables from documents using the excellent GMFT library
- **Modern Python**: Built with async/await, type hints, and a functional-first approach
- **Permissive OSS**: MIT licensed with permissively licensed dependencies

## Quick Start

```bash
pip install kreuzberg

# Or install with CLI support
pip install "kreuzberg[cli]"

# Or install with API server
pip install "kreuzberg[api]"
```

Install pandoc:

```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install -y tesseract pandoc
```

The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.

### Alternative OCR engines

```bash
# Install with EasyOCR support
pip install "kreuzberg[easyocr]"

# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"
```

## Quick Example

```python
import asyncio
from kreuzberg import extract_file

async def main():
    # Extract text from a PDF
    result = await extract_file("document.pdf")
    print(result.content)

    # Extract text from an image
    result = await extract_file("scan.jpg")
    print(result.content)

    # Extract text from a Word document
    result = await extract_file("report.docx")
    print(result.content)

asyncio.run(main())
```

## Docker

Docker images are available for easy deployment:

```bash
# Run the API server
docker run -p 8000:8000 goldziher/kreuzberg:latest

# Extract files via API
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
```

See the [Docker documentation](https://goldziher.github.io/kreuzberg/user-guide/docker/) for more options.

## REST API

Run Kreuzberg as a REST API server:

```bash
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run
```

See the [API documentation](https://goldziher.github.io/kreuzberg/user-guide/api-server/) for endpoints and usage.

## Command Line Interface

Kreuzberg includes a powerful CLI for processing documents from the command line:

```bash
# Extract text from a file
kreuzberg extract document.pdf

# Extract with JSON output and metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Extract from stdin
cat document.html | kreuzberg extract

# Use specific OCR backend
kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de

# Extract with configuration file
kreuzberg extract document.pdf --config config.toml
```

### CLI Configuration

Configure via `pyproject.toml`:

```toml
[tool.kreuzberg]
force_ocr = true
chunk_content = false
extract_tables = true
max_chars = 4000
ocr_backend = "tesseract"

[tool.kreuzberg.tesseract]
language = "eng+deu"
psm = 3
```

For full CLI documentation, see the [CLI Guide](https://goldziher.github.io/kreuzberg/cli/).

## Documentation

For comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):

- [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage
- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information
- [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line interface documentation
- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation
- [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases
- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines
- [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) - Choose the right OCR engine

## Supported Formats

Kreuzberg supports a wide range of document formats:

- **Documents**: PDF, DOCX, RTF, TXT, EPUB, etc.
- **Images**: JPG, PNG, TIFF, BMP, GIF, etc.
- **Spreadsheets**: XLSX, XLS, CSV, etc.
- **Presentations**: PPTX, PPT, etc.
- **Web Content**: HTML, XML, etc.

## OCR Engines

Kreuzberg supports multiple OCR engines:

- **Tesseract** (Default): Lightweight, fast startup, requires system installation
- **EasyOCR**: Good for many languages, pure Python, but downloads models on first use
- **PaddleOCR**: Excellent for Asian languages, pure Python, but downloads models on first use

For comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.

## Performance

Kreuzberg delivers **exceptional performance** compared to other text extraction libraries:

### 🏆 Competitive Benchmarks

[Comprehensive benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) comparing Kreuzberg against other popular Python text extraction libraries show:

- **Fastest Extraction**: Consistently fastest processing times across file formats
- **Lowest Memory Usage**: Most memory-efficient text extraction solution
- **100% Success Rate**: Reliable extraction across all tested document types
- **Optimal for High-Throughput**: Designed for real-time, production applications

### 💾 Installation Size Efficiency

Kreuzberg delivers maximum performance with minimal overhead:

1. **Kreuzberg**: 71.0 MB (20 deps) - Most lightweight
1. **Unstructured**: 145.8 MB (54 deps) - Moderate footprint
1. **MarkItDown**: 250.7 MB (25 deps) - ML inference overhead
1. **Docling**: 1,031.9 MB (88 deps) - Full ML stack included

**Kreuzberg is up to 14x smaller** than competing solutions while delivering superior performance.

### ⚡ Sync vs Async Performance

Kreuzberg is the only library offering both sync and async APIs. Choose based on your use case:

| Operation              | Sync Time | Async Time | Async Advantage    |
| ---------------------- | --------- | ---------- | ------------------ |
| Simple text (Markdown) | 0.4ms     | 17.5ms     | **❌ 41x slower**  |
| HTML documents         | 1.6ms     | 1.1ms      | **✅ 1.5x faster** |
| Complex PDFs           | 39.0s     | 8.5s       | **✅ 4.6x faster** |
| OCR processing         | 0.4s      | 0.7s       | **✅ 1.7x faster** |
| Batch operations       | 38.6s     | 8.5s       | **✅ 4.5x faster** |

**Rule of thumb:** Use async for complex documents, OCR, batch processing, and backend APIs.

For detailed benchmarks and methodology, see our [Performance Documentation](https://goldziher.github.io/kreuzberg/advanced/performance/).

## Contributing

We welcome contributions! Please see our [Contributing Guide](docs/contributing.md) for details on setting up your development environment and submitting pull requests.

## License

This library is released under the MIT license.
