Metadata-Version: 2.1
Name: hte
Version: 0.0.12
Summary: Extracting content from spesific address books
Author: Eirik Berger
Author-email: eirik.berger@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: PyPDF2
Requires-Dist: pdf2image
Requires-Dist: pytesseract
Requires-Dist: tesseract
Requires-Dist: tqdm
Requires-Dist: scikit-image
Requires-Dist: opencv-python
Requires-Dist: pytest-shutil
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: pillow
Requires-Dist: importlib-metadata ; python_version == "3.8"

# historical-text-extraction (hte)
[![PyPI version](https://badge.fury.io/py/hte.svg)](https://badge.fury.io/py/hte)

Package to extract text from historical documents. The package is written for personal use. 

## Installation

The current release from the PyPI repository:

``` bash
pip install hte
```

The development version from [GitHub](https://github.com/) with:

``` bash
pip install git+ssh://git@github.com/eirikberger/hte.git
```
Note that it is nessecary with a SSH key for this approach to work. 

## Using it

Import the package

``` python
from hte import digitize
```

The basic setup is the following:

``` python
# Define class
book = digitize.Book("data/finnmark_1968.pdf", "books")

# Run methods on the class
book.CreateFolderStructure()
book.PdfImport(page_info=False, from_page=21, to_page=263)
book.Split(multiple_columns=True)
book.RunOCR(type="splits", export_image=False)
book.CombineCleanGroup(ocr_grouping=True, group_type='norway')
book.RegexStructure("norway")
```

Make sure to install the correct language package for Tesseract. 

``` bash
# Check languages already installed: 
tesseract --list-langs

# Languages available for installation
apt-cache search tesseract-ocr

# Install the Norwegian language pack
sudo apt-get install tesseract-ocr-nor
```
