Metadata-Version: 2.3
Name: pdfplucker
Version: 0.3.1
Summary: Docling wrapper for PDF parsing
Author: rafaelghiorzi
Author-email: rafael.ghiorzi@gmail.com
Requires-Python: >=3.12,<3.13
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Provides-Extra: gpu
Requires-Dist: PyMuPDF (>=1.25.5,<2.0.0)
Requires-Dist: docling (>=2.30.0,<3.0.0)
Requires-Dist: onnxruntime-gpu (>=1.21.0,<2.0.0) ; extra == "gpu"
Requires-Dist: rapidocr_onnxruntime (>=1.4.4,<2.0.0)
Project-URL: Bug Tracker, https://github.com/rafaelghiorzi/pdfplucker/issues
Project-URL: Repository, https://github.com/rafaelghiorzi/pdfplucker
Description-Content-Type: text/markdown

# PdfPlucker

[![PyPI version](https://badge.fury.io/py/pdfplucker.svg)](https://badge.fury.io/py/pdfplucker)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

PdfPlucker is a powerful wrapper for the Docling library, specifically designed for batch processing PDF files. It provides users with fine-grained control over processing parameters and output configuration through a simple command-line interface.

## Features

- **Comprehensive Extraction**: Extract text, tables, and images from PDF files with high fidelity
- **Structured Outputs**: Get results in well-organized JSON and Markdown formats
- **High Performance**: Process multiple documents simultaneously with parallel processing
- **Hardware Acceleration**: Support for both CPU and CUDA for faster processing
- **Simple Interface**: Intuitive CLI commands for easy parameter control
- **Batch Processing**: Handle directories of PDFs effortlessly

## Installation

PdfPlucker requires Python 3.12 or higher. To install, simply run the following command:

```bash
pip install pdfplucker
```

if you want GPU support, run:

```bash
pip install pdfplucker[gpu]
```
_Note: For GPU support, you may need to install the PyTorch version that matches your CUDA version._
_Check your CUDA version with `nvidia-smi` and visit https://pytorch.org/get-started/locally/ for instructions_

Or install from source:

```bash
git clone https://github.com/rafaelghiorzi/pdfplucker.git
cd pdfplucker
pip install -r requirements.txt
```

## Requirements

- Python 3.12+
- For CUDA support: An NVIDIA GPU with drivers up to date
- Additional dependencies are automatically installed with the package

## Basic Usage

PdfPlucker has a built-in CLI to run the processor. The basic command structure is:

```bash
pdfplucker --source /path/to/pdf
```

This will process the PDF file and save the results to `./results` by default.

## Command-line Options

| Option | Description |
|--------|-------------|
| `-s, --source` | Path to PDF files (directory or single file) |
| `-o, --output` | Path to save processed information (default: `./results`) |
| `-f, --folder-separation` | Create separate folders for each PDF |
| `-i, --images` | Path to save extracted images (ignored if `--folder-separation` is active) |
| `-t, --timeout` | Time limit in seconds for processing each PDF (default: 600) |
| `-w, --workers` | Number of parallel processes (default: 4) |
| `-d, --device` | Processing device: CPU, CUDA, or AUTO (default: AUTO) |
| `-m, --markdown` | Export the document in an additional markdown file |
| `-ocr, --force-ocr` | Force text recognition using ocr even with digital documents | 

### Markdown Output

When enabled with the `--markdown` flag, PdfPlucker will generate a readable Markdown file that includes:
- Formatted document text
- Tables rendered in Markdown syntax
- Embedded images with base64 encoding

### Force OCR option

Docling will extract text from natively digital PDFs. If you wish to force the use of OCR tools to scan the file text, run the command with the `--force-ocr` flag.

### Amount of workers

When processing large amounts of files, note that many workers might lead to RAM shortage and memory leaks, mainly when paired with forced ocr. Try balancing the amount of workers with the amount of available memory and power of your computer.

## Examples

### Process a single PDF file:

```bash
pdfplucker --source document.pdf
```

### Process all PDFs in a directory:

```bash
pdfplucker --source ./documents/ --output ./extracted_data
```

### Create separate folders for each PDF and include markdown output:

```bash
pdfplucker --source ./documents/ --folder-separation --markdown
```

### Specify output location for extracted images:

```bash
pdfplucker --source document.pdf --images ./images
```

### Use CUDA for processing with 8 workers:

```bash
pdfplucker --source ./documents/ --device CUDA --workers 8
```

## Advanced Usage

For processing large batches of PDFs, you can use the folder separation option combined with multiple workers:

```bash
pdfplucker --source ./pdf_collection/ --folder-separation --workers 8 --timeout 300 --force-ocr
```

This will create a separate folder for each PDF, use 8 parallel processes, set a timeout of 5 minutes per PDF and force ocr usage for text recognition.

## Output Structure

PdfPlucker generates structured outputs in the following formats:

### JSON Output

The JSON output contains:
- Document metadata (title, author, date, etc.)
- Extracted text divided into sections (title, text)
- Table data with structure preserved and subtitles, if they exist
- References to extracted images, with subtitles, if they exist

Example structure:
```json
{
    "metadata": {
        "format": "PDF 1.7",
        "title": "Microsoft Word - Sample Title",
        "..."
        "producer": "Microsoft: Print To PDF",
        "creationDate": "D:20250401144737-03'00'",
        "filename": "file.pdf"
    },
    "sections": [
        {
            "title": "Big Title!",
            "text": "Following text after title"
        },
    ],
    "images": [
      {
        "self_ref" : "#picture/1",
        "ref" : "path/to/image.png",
        "subtitle" : "possible subtitle"
      }
    ],
    "tables": [
      {
        "self_ref" : "#table/1",
        "subtitle" : "possible subtitle",
        "table" : {"table in dict format"}
      }
    ]
}
```

## Troubleshooting

### Common Issues

- **MemoryError**: Try reducing the number of workers or processing larger PDFs individually
- **CUDA not detected**: Ensure you have compatible NVIDIA drivers installed and visible to Python
- **Timeout errors**: Increase the timeout value for complex or large documents
- **Missing images**: Check file permissions in the output directory

### Getting Help

If you encounter issues not covered here, please open an issue on GitHub with:
- The command you ran
- The error message
- Your system specifications (OS, Python version, etc.)

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! If you have suggestions for improvements or new features, please:

1. Check existing issues and pull requests
2. Fork the repository
3. Create a new branch for your feature
4. Add your changes
5. Submit a pull request

## Acknowledgments

- [Docling](https://github.com/docling-project/docling) for the core PDF processing capabilities
- All contributors and users of PdfPlucker
