Metadata-Version: 2.1
Name: pytesstrain
Version: 0.1.1
Summary: Collection of utilities for Tesseract OCR training
Home-page: https://github.com/wincentbalin/pytesstrain
Author: Wincent Balin
Author-email: wincent.balin+pytesstrain@gmail.com
License: Apache License (2.0)
Keywords: Tesseract,OCR,training
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Text Processing
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
Requires-Dist: pytesseract
Requires-Dist: editdistance

# Python utilities for Tesseract OCR training

This module is a collection of different training utilities for [Tesseract OCR](https://github.com/tesseract-ocr/tesseract).
These utilities are also implemented as console scripts, hence they can be run from command line. 

## Utilities

All utilities list their command line switches when run with the switch `--help`.

* `rewrap` just rewraps text lines by specified maximal line length
* `create_dictdata` creates all word- and n-gram-lists from a text file, which are translated to DAWGs and added to the traineddata file then
* `language_metrics` creates random texts from supplied wordlist and tests for recognition error rates
* `collect_ambiguities` extracts error-correction pairs from reference-hypothesis pairs and stores them in a JSON file
* `json2unicharambigs` stores specified error-correction pairs from JSON file in a unicharambigs file


## Requirements

This module requires the following modules to work:

* pytesseract (Running Tesseract OCR)
* editdistance (Calculation of error rates)

## Packages

The module is split in several packages. The package `pytesstrain.train` contains the workhorse function
`run_text()`. The package `pytesstrain.cli` contains the utilities you might run at the command line. The package
`pytesstrain.ambigs` contains function around `unicharambigs` file. The package `pytesstrain.text2image` contains
the interface to the `text2image` command from the Tesseract OCR; the interface relies on `pytesseract` module
and is modelled after it as well. The package `pytesstrain.metrics` contains error rate calculations, as well
the interface class `Metrics`. The package `pytesstrain.utils` contains auxiliary functions.


