Metadata-Version: 2.1
Name: dafsa
Version: 0.4
Summary: Library for computing Deterministic Acyclic Finite State Automata (DAFSA)
Home-page: https://github.com/tresoldi/dafsa
Author: Tiago Tresoldi
Author-email: tresoldi@shh.mpg.de
License: MIT
Project-URL: Documentation, https://dafsa.readthedocs.io
Keywords: dafsa,dawg,finite state,deterministic acyclic finite state automaton,directed acyclic word graph
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Description-Content-Type: text/markdown
Requires-Dist: networkx

# DAFSA

[![PyPI](https://img.shields.io/pypi/v/dafsa.svg)](https://pypi.org/project/dafsa)
[![Build Status](https://travis-ci.org/tresoldi/dafsa.svg?branch=master)](https://travis-ci.org/tresoldi/dafsa)
[![codecov](https://codecov.io/gh/tresoldi/dafsa/branch/master/graph/badge.svg)](https://codecov.io/gh/tresoldi/dafsa)
[![Codacy
Badge](https://api.codacy.com/project/badge/Grade/a2b47483ff684590b1208dbb4bbfc3ee)](https://www.codacy.com/manual/tresoldi/dafsa?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=tresoldi/dafsa&amp;utm_campaign=Badge_Grade)
[![Documentation
Status](https://readthedocs.org/projects/dafsa/badge/?version=latest)](https://dafsa.readthedocs.io/en/latest/?badge=latest)

DAFSA is a library for computing [Deterministic Acyclic Finite State Automata](https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton) (also known as "directed acyclic word graphs", or DAWG). DAFSA are data structures derived from [tries](https://en.wikipedia.org/wiki/Trie) that allow to represent a set of sequences (typically character strings or *n*-grams) in the form of a directed acyclic graph with a single source vertex (the `start` symbol of all sequences) and at least one sink edge (`end` symbols, each pointed to by one or more sequences). In the current implementation, a trait of each node expresses whether it can be used a sink.

The primary difference between DAFSA and tries is that the latter eliminates suffix and infix redundancy, as in the example of Figure 1 (from the linked Wikipedia article) storing the set of strings `"tap"`, `"taps"`, `"top"`, and `"tops"`. Even though DAFSAs cannot be used to store precise frequency information, given that multiple paths can reach the same terminal node, they still allow to estimate the sampling frequency; being acyclic, they can also reject any sequence not included in the training. Fuzzy extensions will allow to estimate the sampling probability of unobserved sequences.

![Trie vs. DAFSA](https://raw.githubusercontent.com/tresoldi/dafsa/master/figures/trie-vs-dafsa.png)

This data structure is a special case of a finite state recognizer that acts as a deterministic finite state automaton, as it recognizes all and only the sequences it was built upon. Frequently used in computer science for the space-efficient storage of sets of sequences without common compression techniques, such as dictionary or entropy types, or without probabilistic data structures, such as Bloom filters, the automata generated by this library are intended for linguistic exploration, and extend published models by allowing to approximate probability of random observation by carrying information on the weight of each graph edge.

Full documentation is available [on ReadTheDocs.io](https://dafsa.readthedocs.io),
including detailed instructions on
[how to use the library](https://dafsa.readthedocs.io/en/latest/quickstart.html).

## Installation and usage

For full instructions on installation and usage, please check the
OFFICIAL DOCUMENTATION.

In short, the library can be installed as any standard Python library with
`pip`, and used as demonstrated in the following snippet:

In any standard Python environment, `dafsa` can be installed with:

```python
>>> from dafsa import DAFSA
>>> print(DAFSA(["dib", "tip", "tips", "top"]))
DAFSA with 8 nodes and 9 edges (4 inserted sequences)
  +-- #0: 0(#1/4:<d>/1|#4/4:<t>/3) [('t', 4), ('d', 1)]
  +-- #1: n(#2/1:<i>/1) [('i', 2)]
  +-- #2: n(#3/1:<b>/1) [('b', 3)]
  +-- #3: F() []
  +-- #4: n(#5/3:<i>/2|#8/3:<o>/1) [('i', 5), ('o', 8)]
  +-- #5: n(#6/2:<p>/2) [('p', 6)]
  +-- #6: F(#3/2:<s>/1) [('s', 3)]
  +-- #8: n(#3/1:<p>/1) [('p', 3)]
```

## Showcase

* Basic example

![First example](https://raw.githubusercontent.com/tresoldi/dafsa/master/figures/example.png)

* Graphical, ASCII, and Unicode (through third-party applications)

![DNA example](https://raw.githubusercontent.com/tresoldi/dafsa/master/figures/dna.png)

```
                                   G                                A
                               +---------------------+          +----------+
                               |                     v          |          v
      #====#  C   +---+  G   +---+  C   +---+  G   +---+  A   +---+  T   +---+  A   #===#
  +-- H 0  H ---> | 5 | ---> | 6 | ---> | 7 | ---> | 8 | ---> | 9 | ---> | 3 | ---> H 4 H
  |   #====#      +---+      +---+      +---+      +---+      +---+      +---+      #===#
  |     |    A                                                             ^
  | G   +-----------+                                                      |
  |                 v                                                      |
  |   +----+  G   +---+  A   +---+  T                                      |
  +-> | 20 | ---> | 1 | ---> | 2 | ----------------------------------------+
      +----+      +---+      +---+
```

Or as Unicode box art:

```
                                   G                                A
                               ┌─────────────────────┐          ┌──────────┐
                               │                     ▼          │          ▼
      ╔════╗  C   ┌───┐  G   ┌───┐  C   ┌───┐  G   ┌───┐  A   ┌───┐  T   ┌───┐  A   ╔═══╗
  ┌── ║ 0  ║ ───▶ │ 5 │ ───▶ │ 6 │ ───▶ │ 7 │ ───▶ │ 8 │ ───▶ │ 9 │ ───▶ │ 3 │ ───▶ ║ 4 ║
  │   ╚════╝      └───┘      └───┘      └───┘      └───┘      └───┘      └───┘      ╚═══╝
  │     │    A                                                             ▲
  │ G   └───────────┐                                                      │
  │                 ▼                                                      │
  │   ┌────┐  G   ┌───┐  A   ┌───┐  T                                      │
  └─▶ │ 20 │ ───▶ │ 1 │ ───▶ │ 2 │ ────────────────────────────────────────┘
      └────┘      └───┘      └───┘
```

* Without or with single-path joining

![Phoneme example](https://raw.githubusercontent.com/tresoldi/dafsa/master/figures/phonemes.png)

![Reduced Phoneme example](https://raw.githubusercontent.com/tresoldi/dafsa/master/figures/reduced_phonemes.png)

## Changelog

Version 0.4:
  - Full documentation for existing code
  - Added GML, PDF, and SVG export
  - Allow to access all options from command-line

Version 0.3:
  - Allow to join transitions in single sub-paths
  - Allows to export a DAFSA as a `networkx` graph
  - Preliminary documentation at [ReadTheDocs](https://dafsa.readthedocs.io)

Version 0.2.1:

  - Added support for segmented data

Version 0.2:

  - Added support for weighted edges and nodes
  - Added DOT export and Graphviz generation
  - Refined minimization method, which can be skipped if desired (resulting
    in a standard trie)
  - Added examples in the resources, also used for test data

Version 0.1:

  - First public release.

## Roadmap

Version 0.5:
  - Deal with all TODOs
  - Profile code and make faster and less resource hungry, using
    multiple threads wherever possible, memoization, etc.
  - Add code from Daciuk's packages in an extra directory, along with
    notes on license

Version 1.0:
  - Publish in journal


After 1.0:

  - Preliminary generation of minimal regular expressions matching the
    contents of a DAFSA
  - Consider the addition of empty transitions (or depend on the user
    aligning those)
  - Work on options for nicer graphviz output (colors, widths, etc.)

## Author and citation

The library is developed by Tiago Tresoldi (tresoldi@shh.mpg.de).
The author was supported during development by the
[ERC Grant #715618](https://cordis.europa.eu/project/rcn/206320/factsheet/en)
for the project [CALC](http://calc.digling.org)
(Computer-Assisted Language Comparison: Reconciling Computational and Classical
Approaches in Historical Linguistics).

If you use `dafsa`, please cite it as:

> Tresoldi, Tiago (2019). DAFSA, a a library for computing Deterministic Acyclic Finite State Automata. Version 0.4. Jena. Available at: <https://github.com/tresoldi/dafsa>

In BibTeX:

```bibtex
@misc{Tresoldi2019dafsa,
  author = {Tresoldi, Tiago},
  title = {DAFSA, a a library for computing Deterministic Acyclic Finite State Automata. Version 0.4},
  howpublished = {\url{https://github.com/tresoldi/dafsa}},
  address = {Jena},
  year = {2019},
}
```


