Metadata-Version: 1.1
Name: marisa-trie
Version: 0.7.1
Summary: Static memory-efficient & fast Trie-like structures for Python (based on marisa-trie C++ library)
Home-page: https://github.com/kmike/marisa-trie/
Author: Mikhail Korobov
Author-email: kmike84@gmail.com
License: UNKNOWN
Description: marisa-trie
        ===========
        
        .. image:: https://travis-ci.org/kmike/marisa-trie.png?branch=master
            :target: https://travis-ci.org/kmike/marisa-trie
        
        Static memory-efficient Trie-like structures for Python (2.x and 3.x).
        
        String data in a MARISA-trie may take up to 50x-100x less memory than
        in a standard Python dict; the raw lookup speed is comparable; trie also
        provides fast advanced methods like prefix search.
        
        Based on `marisa-trie`_ C++ library.
        
        .. note::
        
            There are official SWIG-based Python bindings included
            in C++ library distribution; this package provides an alternative
            Cython-based pip-installable Python bindings.
        
        .. _marisa-trie: https://code.google.com/p/marisa-trie/
        
        Installation
        ============
        
        ::
        
            pip install marisa-trie
        
        Usage
        =====
        
        There are several Trie classes in this package:
        
        * ``marisa_trie.Trie`` - read-only trie-based data structure that maps
          unicode keys to auto-generated unique IDs;
        
        * ``marisa_trie.RecordTrie`` - read-only trie-based data structure that
          maps unicode keys to lists of data tuples. All tuples must be of the
          same format (the data is packed using python ``struct`` module).
        
        * ``marisa_trie.BytesTrie`` - read-only Trie that maps unicode
          keys to lists of ``bytes`` objects.
        
        
        marisa_trie.Trie
        ----------------
        
        Create a new trie::
        
            >>> import marisa_trie
            >>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])
        
        Check if key is in trie::
        
            >>> u'key1' in trie
            True
            >>> u'key20' in trie
            False
        
        Each key is assigned an unique ID from 0 to (n - 1), where n is the
        number of keys::
        
            >>> trie.key_id(u'key2')
            1
            >>> trie[u'key2']  # alternative syntax
            1
        
        Note that you can't assign a value to a ``marisa_trie.Trie`` key,
        but can use the returned ID to store a value in a separate data structure
        (e.g. in a python list or numpy array).
        
        Key can be reconstructed from the ID::
        
            >>> trie.restore_key(1)
            u'key2'
        
        Find all prefixes of a given key::
        
            >>> trie.prefixes(u'key12')
            [u'key1', u'key12']
        
        There is also a generator version of ``.prefixes`` method
        called ``.iter_prefixes``.
        
        Find all keys from this trie that starts with a given prefix::
        
            >> trie.keys(u'key1')
            [u'key1', u'key12']
        
        (iterator version ``.iterkeys(prefix)`` is also available).
        
        Use ``items()`` method to return all (key, ID) pairs::
        
            >>> trie.items()
            [(u'key1', 0), (u'key12', 2), (u'key2', 1)]
        
        Filter them by prefix::
        
            >>> trie.items(u'key1')
            [(u'key1', 0), (u'key12', 2)]
        
        (iterator version ``.iteritems(prefix)`` is also available).
        
        marisa_trie.RecordTrie
        ----------------------
        
        Create a new trie::
        
            >>> keys = [u'foo', u'bar', u'foobar', u'foo']
            >>> values = [(1, 2), (2, 1), (3, 3), (2, 1)]
            >>> fmt = "<HH"   # a tuple with 2 short integers
            >>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))
        
        Trie initial data must be an iterable of tuples ``(unicode_key, data_tuple)``.
        Data tuples will be converted to bytes with ``struct.pack(fmt, *data_tuple)``.
        
        Take a look at http://docs.python.org/library/struct.html#format-strings
        for the format string specification.
        
        Duplicate keys are allowed.
        
        Check if key is in trie::
        
            >>> u'foo' in trie
            True
            >>> u'spam' in trie
            False
        
        Get a values list::
        
            >>> trie[u'bar']
            [(2, 1)]
            >>> trie[u'foo']
            [(1, 2), (2, 1)]
            >>> trie.get(u'bar', 123)
            [(2, 1)]
            >>> trie.get(u'BAAR', 123) # default value
            123
        
        
        Find all prefixes of a given key::
        
            >>> trie.prefixes(u'foobarz')
            [u'foo', u'foobar']
        
        Test whether some key begins with a given prefix::
        
            >>> trie.has_keys_with_prefix(u'fo')
            True
            >>> trie.has_keys_with_prefix(u'go')
            False
        
        Find all keys from this trie that starts with a given prefix::
        
            >> trie.keys(u'fo')
            [u'foo', u'foo', u'foobar']
        
        Find all items from this trie that starts with a given prefix::
        
            >> trie.items(u'fo')
            [(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]
        
        
        .. note::
        
            Iterator version of ``.keys()`` and ``.items()`` are not implemented yet.
        
        marisa_trie.BytesTrie
        ---------------------
        
        ``BytesTrie`` is similar to ``RecordTrie``, but the values are raw bytes,
        not tuples::
        
            >>> keys = [u'foo', u'bar', u'foobar', u'foo']
            >>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2']
            >>> trie = marisa_trie.BytesTrie(zip(keys, values))
            >>> trie[u'bar']
            [b'bar-value']
        
        
        Persistence
        -----------
        
        Trie objects supports saving/loading, pickling/unpickling
        and memory mapped I/O.
        
        Write trie to a stream::
        
            >>> with open('my_trie.marisa', 'w') as f:
            ...     trie.write(f)
        
        Save trie to a file::
        
            >>> trie.save('my_trie_copy.marisa')
        
        Read trie from stream::
        
            >>> trie2 = marisa_trie.Trie()
            >>> with open('my_trie.marisa', 'r') as f:
            ...     trie.read(f)
        
        
        Load trie from file::
        
            >>> trie2.load('my_trie.marisa')
        
        Trie objects are picklable::
        
            >>> import pickle
            >>> data = pickle.dumps(trie)
            >>> trie3 = pickle.loads(data)
        
        You may also build a trie using ``marisa-build`` command-line
        utility (provided by underlying C++ library; it should be downloaded and
        compiled separately) and then load the trie from the resulting file
        using ``.load()`` method.
        
        Memory mapped I/O
        -----------------
        
        It is possible to use memory mapped file as data source::
        
            >>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')
        
        This way the whole dictionary won't be loaded to memory; memory
        mapped I/O is an easy way to share dictionary data among processes.
        
        .. warning::
        
            Memory mapped trie might cause a lot of random disk accesses which
            considerably increase the search time.
        
        Trie storage options
        --------------------
        
        `marisa-trie`_ C++ library provides some configuration options for trie storage;
        check http://marisa-trie.googlecode.com/svn/trunk/docs/readme.en.html page
        (scroll down to "Enumeration Constants" section) to get an idea.
        
        These options are exposed as ``order``, ``num_tries``, ``cache_size``
        and ``binary`` keyword arguments for trie constructors.
        
        For example, set ``order`` to ``marisa_trie.LABEL_ORDER`` in order to
        make trie functions return results in alphabetical oder::
        
            >>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)
        
        Note that two tries constructed from identical data but with different ``order``
        arguments will compare unequal::
        
            >>> t1 = marisa_trie.Trie(order=marisa_trie.LABEL_ORDER)
            >>> t2 = marisa_trie.Trie(order=marisa_trie.WEIGHT_ORDER)
            >>> t1 == t2
            False
        
        
        Benchmarks
        ==========
        
        My quick tests show that memory usage is quite decent.
        For a list of 3000000 (3 million) Russian words memory consumption
        with different data structures (under Python 2.7):
        
        * dict(unicode words -> word lenghts): about 600M
        * list(unicode words) : about 300M
        * BaseTrie from datrie_ library: about 70M
        * ``marisa_trie.RecordTrie`` : 11M
        * ``marisa_trie.Trie``: 7M
        
        
        .. note::
        
            Lengths of words were stored as values in ``datrie.BaseTrie``
            and ``marisa_trie.RecordTrie``. ``RecordTrie`` compresses
            similar values and the key compression is better so it uses
            much less memory than ``datrie.BaseTrie``.
        
            ``marisa_trie.Trie`` provides auto-assigned IDs. It is not possible
            to store arbitrary values in ``marisa_trie.Trie`` so it uses less
            memory than ``RecordTrie``.
        
        Benchmark results (100k unicode words, integer values (lenghts of the words),
        Python 3.2, macbook air i5 1.8 Ghz)::
        
            dict building                     2.919M words/sec
            Trie building                     0.394M words/sec
            BytesTrie building                0.355M words/sec
            RecordTrie building               0.354M words/sec
        
            dict __getitem__ (hits)           8.239M ops/sec
            Trie __getitem__ (hits)           not supported
            BytesTrie __getitem__ (hits)      0.498M ops/sec
            RecordTrie __getitem__ (hits)     0.404M ops/sec
        
            dict get() (hits)                 4.410M ops/sec
            Trie get() (hits)                 not supported
            BytesTrie get() (hits)            0.458M ops/sec
            RecordTrie get() (hits)           0.364M ops/sec
            dict get() (misses)               4.869M ops/sec
            Trie get() (misses)               not supported
            BytesTrie get() (misses)          0.849M ops/sec
            RecordTrie get() (misses)         0.816M ops/sec
        
            dict __contains__ (hits)          8.053M ops/sec
            Trie __contains__ (hits)          1.018M ops/sec
            BytesTrie __contains__ (hits)     0.605M ops/sec
            RecordTrie __contains__ (hits)    0.618M ops/sec
            dict __contains__ (misses)        6.489M ops/sec
            Trie __contains__ (misses)        2.047M ops/sec
            BytesTrie __contains__ (misses)   1.079M ops/sec
            RecordTrie __contains__ (misses)  1.123M ops/sec
        
            dict items()                      57.248 ops/sec
            Trie items()                      not supported
            BytesTrie items()                 11.691 ops/sec
            RecordTrie items()                8.369 ops/sec
        
            dict keys()                       217.920 ops/sec
            Trie keys()                       19.589 ops/sec
            BytesTrie keys()                  14.849 ops/sec
            RecordTrie keys()                 15.369 ops/sec
        
            Trie.prefixes (hits)              0.594M ops/sec
            Trie.prefixes (mixed)             1.874M ops/sec
            Trie.prefixes (misses)            1.447M ops/sec
            RecordTrie.prefixes (hits)        0.103M ops/sec
            RecordTrie.prefixes (mixed)       0.458M ops/sec
            RecordTrie.prefixes (misses)      0.164M ops/sec
            Trie.iter_prefixes (hits)         0.588M ops/sec
            Trie.iter_prefixes (mixed)        1.470M ops/sec
            Trie.iter_prefixes (misses)       1.170M ops/sec
        
            Trie.keys(prefix="xxx"), avg_len(res)==415                   5.044K ops/sec
            Trie.keys(prefix="xxxxx"), avg_len(res)==17                  89.363K ops/sec
            Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3                258.732K ops/sec
            Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4             293.199K ops/sec
            Trie.keys(prefix="xxx"), NON_EXISTING                        1169.524K ops/sec
        
            RecordTrie.keys(prefix="xxx"), avg_len(res)==415             3.836K ops/sec
            RecordTrie.keys(prefix="xxxxx"), avg_len(res)==17            73.591K ops/sec
            RecordTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3          229.515K ops/sec
            RecordTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4       269.228K ops/sec
            RecordTrie.keys(prefix="xxx"), NON_EXISTING                  1071.433K ops/sec
        
        
        Tries from ``marisa_trie`` are static and uses less memory, tries from
        `datrie`_ are faster and can be updated.
        
        You may also give DAWG_ a try - it is usually faster than
        ``marisa-trie`` and sometimes can use less memory (depending on data).
        
        Please take this benchmark results with a grain of salt; this
        is a very simple benchmark on a single data set.
        
        .. _datrie: https://github.com/kmike/datrie
        .. _DAWG: https://github.com/kmike/DAWG
        
        Current limitations
        ===================
        
        * The library is not tested with mingw32 compiler;
        * ``.prefixes()`` method of ``BytesTrie`` and ``RecordTrie`` is quite slow
          and doesn't have iterator counterpart;
        * ``read()`` and ``write()`` methods don't work with file-like objects
          (they work only with real files; pickling works fine for file-like objects);
        * there are ``keys()`` and ``items()`` methods but no ``values()`` method.
        
        Contributions are welcome!
        
        Contributing
        ============
        
        Development happens at github: https://github.com/kmike/marisa-trie
        
        Feel free to submit ideas, bug reports and pull requests.
        
        If you found a bug in a C++ part please report it to the original
        `bug tracker <https://code.google.com/p/marisa-trie/issues/list>`_.
        
        How is source code organized
        ----------------------------
        
        There are 4 folders in repository:
        
        * ``bench`` - benchmarks & benchmark data;
        * ``lib`` - original unmodified `marisa-trie`_ C++ library which is bundled
          for easier distribution; if something is have to be fixed in this library
          consider fixing it in the `original repo <https://code.google.com/p/marisa-trie/>`_ ;
        * ``src`` - wrapper code; ``src/marisa_trie.pyx`` is a wrapper implementation;
          ``src/*.pxd`` files are Cython headers for corresponding C++ headers;
          ``src/*.cpp`` files are the pre-built extension code and shouldn't be
          modified directly (they should be updated via ``update_cpp.sh`` script).
        * ``tests`` - the test suite.
        
        
        Running tests and benchmarks
        ----------------------------
        
        Make sure `tox`_ is installed and run
        
        ::
        
            $ tox
        
        from the source checkout. Tests should pass under python 2.6, 2.7,
        3.2, 3.3 and 3.4.
        
        In order to run benchmarks, type
        
        ::
        
            $ tox -c bench.ini
        
        
        .. _cython: http://cython.org
        .. _tox: http://tox.testrun.org
        
        Authors & Contributors
        ----------------------
        
        * Mikhail Korobov <kmike84@gmail.com>
        * `Matt Hickford <https://github.com/matt-hickford>`_
        * Sergei Lebedev <superbobry@gmail.com>
        
        This module is based on `marisa-trie`_ C++ library by
        Susumu Yata & contributors.
        
        License
        =======
        
        Wrapper code is licensed under MIT License.
        Bundled `marisa-trie`_ C++ library is dual-licensed under
        LGPL and BSD 2-clause license.
        
        
        
        CHANGES
        =======
        
        0.7.1 (2015-04-21)
        ------------------
        
        * setup.py is switched to setuptools;
        * a tiny speedup;
        * wrapper is rebuilt with Cython 0.22.
        
        0.7 (2014-12-15)
        ----------------
        
        * ``trie1 == trie2`` and ``trie1 != trie2`` now work (thanks Sergei Lebedev);
        * ``for key in trie:`` is fixed (thanks Sergei Lebedev);
        * wrapper is rebuilt with Cython 0.21.1 (thanks Sergei Lebedev);
        * https://bitbucket.org/kmike/marisa-trie repo is no longer supported.
        
        0.6 (2014-02-22)
        ----------------
        
        * New ``Trie`` methods: ``__getitem__``, ``get``, ``items``, ``iteritems``.
          ``trie[u'key']`` is now the same as ``trie.key_id(u'key')``.
        * small optimization for ``BytesTrie.get``.
        * wrapper is rebuilt with Cython 0.20.1.
        
        0.5.3 (2014-02-08)
        ------------------
        
        * small ``Trie.restore_key`` optimization (it should work 5-15% faster)
        
        0.5.2 (2014-02-08)
        ------------------
        
        * fix ``Trie.restore_key`` method - it was reading past declared string length;
        * rebuild wrapper with Cython 0.20.
        
        0.5.1 (2013-10-03)
        ------------------
        
        * ``has_keys_with_prefix(prefix)`` method (thanks
          `Matt Hickford <https://github.com/matt-hickford>`_)
        
        0.5 (2013-05-07)
        ----------------
        
        * ``BytesTrie.iterkeys``, ``BytesTrie.iteritems``,
          ``RecordTrie.iterkeys`` and ``RecordTrie.iteritems`` methods;
        * wrapper is rebuilt with Cython 0.19;
        * ``value_separator`` parameter for ``BytesTrie`` and ``RecordTrie``.
        
        0.4 (2013-02-28)
        ----------------
        
        * improved trie building: ``weights`` optional parameter;
        * improved trie building: unnecessary input sorting is removed;
        * wrapper is rebuilt with Cython 0.18;
        * bundled marisa-trie C++ library is updated to svn r133.
        
        0.3.8 (2013-01-03)
        ------------------
        
        * Rebuild wrapper with Cython pre-0.18;
        * update benchmarks.
        
        0.3.7 (2012-09-21)
        ------------------
        
        * Update bundled marisa-trie C++ library (this may fix more mingw issues);
        * Python 3.3 support is back.
        
        0.3.6 (2012-09-05)
        ------------------
        
        * much faster (3x-7x) ``.items()`` and ``.keys()`` methods for all tries;
          faster (up to 3x) ``.prefixes()`` method for ``Trie``.
        
        0.3.5 (2012-08-30)
        ------------------
        
        * Pickling of RecordTrie is fixed (thanks lazarou for the report);
        * error messages should become more useful.
        
        0.3.4 (2012-08-29)
        ------------------
        
        * Issues with mingw32 should be resolved (thanks Susumu Yata).
        
        0.3.3 (2012-08-27)
        ------------------
        
        * ``.get(key, default=None)`` method for ``BytesTrie`` and ``RecordTrie``;
        * small README improvements.
        
        0.3.2 (2012-08-26)
        ------------------
        
        * Small code cleanup;
        * ``load``, ``read`` and ``mmap`` methods returns 'self';
        * I can't run tests (via tox) under Python 3.3 so it is
          removed from supported versions for now.
        
        0.3.1 (2012-08-23)
        ------------------
        
        * ``.prefixes()`` support for RecordTrie and BytesTrie.
        
        0.3 (2012-08-23)
        ----------------
        
        * RecordTrie and BytesTrie are introduced;
        * IntTrie class is removed (probably temporary?);
        * dumps/loads methods are renamed to tobytes/frombytes;
        * benchmark & tests improvements;
        * support for MARISA-trie config options is added.
        
        0.2 (2012-08-19)
        ------------------
        
        * Pickling/unpickling support;
        * dumps/loads methods;
        * python 3.3 workaround;
        * improved tests;
        * benchmarks.
        
        0.1 (2012-08-17)
        ----------------
        
        Initial release.
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
