Metadata-Version: 2.4
Name: kreuzberg
Version: 3.18.0
Summary: Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
Project-URL: documentation, https://kreuzberg.dev
Project-URL: homepage, https://github.com/Goldziher/kreuzberg
Author-email: Na'aman Hirschfeld <nhirschfed@gmail.com>
License: MIT
License-File: LICENSE
Keywords: async,document-analysis,document-classification,document-intelligence,document-processing,extensible,information-extraction,mcp,metadata-extraction,model-context-protocol,ocr,pandoc,pdf-extraction,pdfium,plugin-architecture,rag,retrieval-augmented-generation,structured-data,table-extraction,tesseract,text-extraction
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
Classifier: Topic :: Office/Business :: Office Suites
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: anyio>=4.11.0
Requires-Dist: chardetng-py>=0.3.5
Requires-Dist: exceptiongroup>=1.2.2; python_version < '3.11'
Requires-Dist: html-to-markdown[lxml]>=1.16.0
Requires-Dist: langcodes>=3.5.0
Requires-Dist: mcp>=1.15.0
Requires-Dist: msgspec>=0.18.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: playa-pdf>=0.7.0
Requires-Dist: polars>=1.33.1
Requires-Dist: psutil>=7.1.0
Requires-Dist: pypdfium2==4.30.0
Requires-Dist: python-calamine>=0.5.3
Requires-Dist: python-pptx>=1.0.2
Requires-Dist: typing-extensions>=4.15.0; python_version < '3.12'
Provides-Extra: additional-extensions
Requires-Dist: mailparse>=1.0.15; extra == 'additional-extensions'
Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'additional-extensions'
Provides-Extra: all
Requires-Dist: click>=8.2.1; extra == 'all'
Requires-Dist: deep-translator>=1.11.4; extra == 'all'
Requires-Dist: easyocr>=1.7.2; extra == 'all'
Requires-Dist: fast-langdetect>=1.0.0; extra == 'all'
Requires-Dist: gmft>=0.4.2; extra == 'all'
Requires-Dist: keybert>=0.9.0; extra == 'all'
Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.17.0; extra == 'all'
Requires-Dist: mailparse>=1.0.15; extra == 'all'
Requires-Dist: paddleocr>=3.2.0; extra == 'all'
Requires-Dist: paddlepaddle>=3.2.0; extra == 'all'
Requires-Dist: playa-pdf[crypto]>=0.7.0; extra == 'all'
Requires-Dist: rich>=14.1.0; extra == 'all'
Requires-Dist: semantic-text-splitter>=0.28.0; extra == 'all'
Requires-Dist: setuptools>=80.9.0; extra == 'all'
Requires-Dist: spacy>=3.8.7; extra == 'all'
Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'all'
Provides-Extra: api
Requires-Dist: litestar[opentelemetry,standard,structlog]>=2.17.0; extra == 'api'
Provides-Extra: chunking
Requires-Dist: semantic-text-splitter>=0.28.0; extra == 'chunking'
Provides-Extra: cli
Requires-Dist: click>=8.2.1; extra == 'cli'
Requires-Dist: rich>=14.1.0; extra == 'cli'
Requires-Dist: tomli>=2.0.0; (python_version < '3.11') and extra == 'cli'
Provides-Extra: crypto
Requires-Dist: playa-pdf[crypto]>=0.7.0; extra == 'crypto'
Provides-Extra: document-classification
Requires-Dist: deep-translator>=1.11.4; extra == 'document-classification'
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7.2; extra == 'easyocr'
Provides-Extra: entity-extraction
Requires-Dist: keybert>=0.9.0; extra == 'entity-extraction'
Requires-Dist: spacy>=3.8.7; extra == 'entity-extraction'
Provides-Extra: gmft
Requires-Dist: gmft>=0.4.2; extra == 'gmft'
Provides-Extra: langdetect
Requires-Dist: fast-langdetect>=1.0.0; extra == 'langdetect'
Provides-Extra: paddleocr
Requires-Dist: paddleocr>=3.2.0; extra == 'paddleocr'
Requires-Dist: paddlepaddle>=3.2.0; extra == 'paddleocr'
Requires-Dist: setuptools>=80.9.0; extra == 'paddleocr'
Description-Content-Type: text/markdown

# Kreuzberg

[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
[![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
[![Benchmarks](https://img.shields.io/badge/benchmarks-fastest%20CPU-orange)](https://benchmarks.kreuzberg.dev/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![DeepSource](https://app.deepsource.com/gh/Goldziher/kreuzberg.svg/?label=code+coverage&show_trend=true&token=U8AW1VWWSLwVhrbtL8LmLBDN)](https://app.deepsource.com/gh/Goldziher/kreuzberg/)

**A document intelligence framework for Python.** Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 **[Complete Documentation](https://kreuzberg.dev/)**

## Framework Overview

### Document Intelligence Capabilities

- **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
- **Image Extraction**: Extract embedded images from PDFs, presentations, HTML, and Office documents with optional OCR
- **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
- **Format Support**: 21 document types including PDF, Microsoft Office, images, HTML, and structured data formats
- **OCR Integration**: Tesseract OCR with markdown output (default) and table extraction from scanned documents
- **Document Classification**: Automatic document type detection (contracts, forms, invoices, receipts, reports)

### Technical Architecture

- **Performance**: Highest throughput among Python document processing frameworks (30+ docs/second)
- **Resource Efficiency**: 71MB installation, ~360MB runtime memory footprint
- **Extensibility**: Plugin architecture for custom extractors via the Extractor base class
- **API Design**: Synchronous and asynchronous APIs with consistent interfaces
- **Type Safety**: Complete type annotations throughout the codebase

### Open Source Foundation

Kreuzberg leverages established open source technologies:

- **Pandoc**: Universal document converter for robust format support
- **PDFium**: Google's PDF rendering engine for accurate PDF processing
- **Tesseract**: Google's OCR engine for text recognition
- **Python-docx/pptx**: Native Microsoft Office format support

## Quick Start

### Extract Text with CLI

```bash
# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt

# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json
```

### Python Usage

**Async (recommended for web apps):**

```python
from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")
```

**Sync (for scripts and CLI tools):**

```python
from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
```

### Docker

Two optimized images available:

```bash
# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg

# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
```

📖 **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** • **[CLI Documentation](https://kreuzberg.dev/cli/)** • **[API Reference](https://kreuzberg.dev/api-reference/)**

## Deployment Options

### 🤖 MCP Server (AI Integration)

**Add to Claude Desktop with one command:**

```bash
claude mcp add kreuzberg uvx kreuzberg-mcp
```

**Or configure manually in `claude_desktop_config.json`:**

```json
{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["kreuzberg-mcp"]
    }
  }
}
```

**MCP capabilities:**

- Extract text from PDFs, images, Office docs, and more
- Multilingual OCR support with Tesseract
- Metadata parsing and language detection

📖 **[MCP Documentation](https://kreuzberg.dev/user-guide/mcp-server/)**

## Supported Formats

| Category            | Formats                        |
| ------------------- | ------------------------------ |
| **Documents**       | PDF, DOCX, DOC, RTF, TXT, EPUB |
| **Images**          | JPG, PNG, TIFF, BMP, GIF, WEBP |
| **Spreadsheets**    | XLSX, XLS, CSV, ODS            |
| **Presentations**   | PPTX, PPT, ODP                 |
| **Web**             | HTML, XML, MHTML               |
| **Structured Data** | JSON, YAML, TOML               |
| **Archives**        | Support via extraction         |

## 📊 Performance Characteristics

[View comprehensive benchmarks](https://benchmarks.kreuzberg.dev/) • [Benchmark methodology](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [**Detailed Analysis**](https://kreuzberg.dev/performance-analysis/)

### Technical Specifications

| Metric                       | Kreuzberg Sync | Kreuzberg Async | Benchmarked        |
| ---------------------------- | -------------- | --------------- | ------------------ |
| **Throughput (tiny files)**  | 31.78 files/s  | 23.94 files/s   | Highest throughput |
| **Throughput (small files)** | 8.91 files/s   | 9.31 files/s    | Highest throughput |
| **Memory footprint**         | 359.8 MB       | 395.2 MB        | Lowest usage       |
| **Installation size**        | 71 MB          | 71 MB           | Smallest size      |
| **Success rate**             | 100%           | 100%            | Perfect            |
| **Supported formats**        | 18             | 18              | Comprehensive      |

### Architecture Advantages

- **Native C extensions**: Built on PDFium and Tesseract for maximum performance
- **Async/await support**: True asynchronous processing with intelligent task scheduling
- **Memory efficiency**: Streaming architecture minimizes memory allocation
- **Process pooling**: Automatic multiprocessing for CPU-intensive operations
- **Optimized data flow**: Efficient data handling with minimal transformations

> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

## Documentation

### Quick Links

- [Installation Guide](https://kreuzberg.dev/getting-started/installation/) - Setup and dependencies
- [User Guide](https://kreuzberg.dev/user-guide/) - Comprehensive usage guide
- [Performance Analysis](https://kreuzberg.dev/performance-analysis/) - Detailed benchmark results
- [API Reference](https://kreuzberg.dev/api-reference/) - Complete API documentation
- [Docker Guide](https://kreuzberg.dev/user-guide/docker/) - Container deployment
- [REST API](https://kreuzberg.dev/user-guide/api-server/) - HTTP endpoints
- [CLI Guide](https://kreuzberg.dev/cli/) - Command-line usage
- [OCR Configuration](https://kreuzberg.dev/user-guide/ocr-configuration/) - OCR engine setup

## License

MIT License - see [LICENSE](LICENSE) for details.
