Metadata-Version: 2.0
Name: DAWG-Python
Version: 0.7.2
Summary: Pure-python reader for DAWGs (DAFSAs) created by dawgdic C++ library or DAWG Python extension.
Home-page: https://github.com/kmike/DAWG-Python/
Author: Mikhail Korobov
Author-email: kmike84@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic

DAWG-Python
===========

.. image:: https://travis-ci.org/kmike/DAWG-Python.png?branch=master
    :target: https://travis-ci.org/kmike/DAWG-Python
.. image:: https://coveralls.io/repos/kmike/DAWG-Python/badge.png?branch=master
    :target: https://coveralls.io/r/kmike/DAWG-Python


This pure-python package provides read-only access for files
created by `dawgdic`_ C++ library and `DAWG`_ python package.

.. _dawgdic: https://code.google.com/p/dawgdic/
.. _DAWG: https://github.com/kmike/DAWG

This package is not capable of creating DAWGs. It works with DAWGs built by
`dawgdic`_ C++ library or `DAWG`_ Python extension module. The main purpose
of DAWG-Python is to provide an access to DAWGs without requiring compiled
extensions. It is also quite fast under PyPy (see benchmarks).

Installation
============

pip install DAWG-Python

Usage
=====

The aim of DAWG-Python is to be API- and binary-compatible
with `DAWG`_ when it is possible.

First, you have to create a dawg using DAWG_ module::

    import dawg
    d = dawg.DAWG(data)
    d.save('words.dawg')

And then this dawg can be loaded without requiring C extensions::

    import dawg_python
    d = dawg_python.DAWG().load('words.dawg')

Please consult `DAWG`_ docs for detailed usage. Some features
(like constructor parameters or ``save`` method) are intentionally
unsupported.

Benchmarks
==========

Benchmark results (100k unicode words, integer values (lenghts of the words),
PyPy 1.9, macbook air i5 1.8 Ghz)::

    dict __getitem__ (hits):        11.090M ops/sec
    DAWG __getitem__ (hits):        not supported
    BytesDAWG __getitem__ (hits):   0.493M ops/sec
    RecordDAWG __getitem__ (hits):  0.376M ops/sec

    dict get() (hits):              10.127M ops/sec
    DAWG get() (hits):              not supported
    BytesDAWG get() (hits):         0.481M ops/sec
    RecordDAWG get() (hits):        0.402M ops/sec
    dict get() (misses):            14.885M ops/sec
    DAWG get() (misses):            not supported
    BytesDAWG get() (misses):       1.259M ops/sec
    RecordDAWG get() (misses):      1.337M ops/sec

    dict __contains__ (hits):           11.100M ops/sec
    DAWG __contains__ (hits):           1.317M ops/sec
    BytesDAWG __contains__ (hits):      1.107M ops/sec
    RecordDAWG __contains__ (hits):     1.095M ops/sec

    dict __contains__ (misses):         10.567M ops/sec
    DAWG __contains__ (misses):         1.902M ops/sec
    BytesDAWG __contains__ (misses):    1.873M ops/sec
    RecordDAWG __contains__ (misses):   1.862M ops/sec

    dict items():           44.401 ops/sec
    DAWG items():           not supported
    BytesDAWG items():      3.226 ops/sec
    RecordDAWG items():     2.987 ops/sec
    dict keys():            426.250 ops/sec
    DAWG keys():            not supported
    BytesDAWG keys():       6.050 ops/sec
    RecordDAWG keys():      6.363 ops/sec

    DAWG.prefixes (hits):    0.756M ops/sec
    DAWG.prefixes (mixed):   1.965M ops/sec
    DAWG.prefixes (misses):  1.773M ops/sec

    RecordDAWG.keys(prefix="xxx"), avg_len(res)==415:       1.429K ops/sec
    RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17:      36.994K ops/sec
    RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3:    121.897K ops/sec
    RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 265.015K ops/sec
    RecordDAWG.keys(prefix="xxx"), NON_EXISTING:            2450.898K ops/sec

Under CPython expect it to be about 50x slower.
Memory consumption of DAWG-Python should be the same as of `DAWG`_.

.. _marisa-trie: https://github.com/kmike/marisa-trie

Current limitations
===================

* This package is not capable of creating DAWGs;
* all the limitations of `DAWG`_ apply.

Contributions are welcome!


Contributing
============

Development happens at github: https://github.com/kmike/DAWG-Python
Issue tracker: https://github.com/kmike/DAWG-Python/issues

Feel free to submit ideas, bugs or pull requests.

Running tests and benchmarks
----------------------------

Make sure `tox`_ is installed and run

::

    $ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2, 3.3,
3.4 and PyPy >= 1.9.

In order to run benchmarks, type

::

    $ tox -c bench.ini -e pypy

This runs benchmarks under PyPy (they are about 50x slower under CPython).

.. _tox: http://tox.testrun.org

Authors & Contributors
----------------------

* Mikhail Korobov <kmike84@gmail.com>

The algorithms are from `dawgdic`_ C++ library by Susumu Yata & contributors.

License
=======

This package is licensed under MIT License.



Changes
=======

0.7.2 (2015-04-18)
------------------

- minor speedup;
- bitbucket mirror is no longer maintained.

0.7.1 (2014-06-05)
------------------

- Switch to setuptools;
- upload wheel tp pypi;
- check Python 3.4 compatibility.

0.7 (2013-10-13)
----------------

IntDAWG and IntCompletionDAWG are implemented.

0.6 (2013-03-23)
----------------

Use less shared state internally. This should fix thread-safety bugs and
make iterkeys/iteritems reenterant.

0.5.1 (2013-03-01)
------------------

Internal tweaks: memory usage is reduced; something is a bit faster,
something is a bit slower.

0.5 (2012-10-08)
----------------

Storage scheme is updated to match DAWG==0.5. This enables
the alphabetical ordering of ``BytesDAWG`` and ``RecordDAWG`` items.

In order to read ``BytesDAWG`` or ``RecordDAWG`` created with
versions of DAWG < 0.5 use ``payload_separator`` constructor argument::

    >>> BytesDAWG(payload_separator=b'\xff').load('old.dawg')


0.3.1 (2012-10-01)
------------------

Bug with empty DAWGs is fixed.

0.3 (2012-09-26)
----------------

- ``iterkeys`` and ``iteritems`` methods.

0.2 (2012-09-24)
----------------

``prefixes`` support.

0.1 (2012-09-20)
----------------

Initial release.


