Metadata-Version: 2.1
Name: Spark-Matcher
Version: 0.1
Summary: Record matching and entity resolution at scale in Spark
Author: Ahmet Bayraktar, Stan Leisink, Frits Hermans
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: python-Levenshtein
Requires-Dist: thefuzz
Requires-Dist: modAL
Requires-Dist: pytest
Requires-Dist: multipledispatch
Requires-Dist: dill
Requires-Dist: graphframes
Requires-Dist: scipy
Provides-Extra: base
Requires-Dist: pandas ; extra == 'base'
Requires-Dist: numpy ; extra == 'base'
Requires-Dist: scikit-learn ; extra == 'base'
Requires-Dist: python-Levenshtein ; extra == 'base'
Requires-Dist: thefuzz ; extra == 'base'
Requires-Dist: modAL ; extra == 'base'
Requires-Dist: pytest ; extra == 'base'
Requires-Dist: multipledispatch ; extra == 'base'
Requires-Dist: dill ; extra == 'base'
Requires-Dist: graphframes ; extra == 'base'
Requires-Dist: scipy ; extra == 'base'
Provides-Extra: dev
Requires-Dist: pandas ; extra == 'dev'
Requires-Dist: numpy ; extra == 'dev'
Requires-Dist: scikit-learn ; extra == 'dev'
Requires-Dist: python-Levenshtein ; extra == 'dev'
Requires-Dist: thefuzz ; extra == 'dev'
Requires-Dist: modAL ; extra == 'dev'
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: multipledispatch ; extra == 'dev'
Requires-Dist: dill ; extra == 'dev'
Requires-Dist: graphframes ; extra == 'dev'
Requires-Dist: scipy ; extra == 'dev'
Requires-Dist: sphinx ; extra == 'dev'
Requires-Dist: nbsphinx ; extra == 'dev'
Requires-Dist: sphinx-rtd-theme ; extra == 'dev'
Requires-Dist: pyspark ; extra == 'dev'
Requires-Dist: pyarrow ; extra == 'dev'
Requires-Dist: jupyterlab ; extra == 'dev'
Provides-Extra: doc
Requires-Dist: pandas ; extra == 'doc'
Requires-Dist: numpy ; extra == 'doc'
Requires-Dist: scikit-learn ; extra == 'doc'
Requires-Dist: python-Levenshtein ; extra == 'doc'
Requires-Dist: thefuzz ; extra == 'doc'
Requires-Dist: modAL ; extra == 'doc'
Requires-Dist: pytest ; extra == 'doc'
Requires-Dist: multipledispatch ; extra == 'doc'
Requires-Dist: dill ; extra == 'doc'
Requires-Dist: graphframes ; extra == 'doc'
Requires-Dist: scipy ; extra == 'doc'
Requires-Dist: sphinx ; extra == 'doc'
Requires-Dist: nbsphinx ; extra == 'doc'
Requires-Dist: sphinx-rtd-theme ; extra == 'doc'

![spark_matcher_logo](docs/source/_static/spark_matcher_logo.png)

# Spark-Matcher

Spark-Matcher is a scalable entity matching algorithm implemented in PySpark. With Spark-Matcher the user can easily
train an algorithm to solve a custom matching problem. Spark Matcher uses active learning (modAL) to train a
classifier (Scikit-learn) to match entities. In order to deal with the N^2 complexity of matching large tables, blocking is
implemented to reduce the number of pairs. Since the implementation is done in PySpark, Spark Matcher can deal with
extremely large tables.

Developed by data scientists at ING Analytics, www.ing.com.

## Installation

### Normal installation

Install after cloning this repo:

```
pip install .
```

### Install with possibility to create documentation

Add `[doc]` like this:

```
pip install ".[doc]"
```

### Install to contribute

Clone this repo and install in editable mode. This also installs PySpark and Jupyterlab:

```
python -m pip install -e ".[dev]"
python setup.py develop
```

## Documentation

Documentation can be created using the following command:

```
make create_documentation
```

## Dependencies

The usage examples in the `examples` directory contain notebooks that run in local mode. 
Using the SparkMatcher in cluster mode, requires sending the SparkMatcher package and several other python packages (see spark_requirements.txt) to the executors.
How to send these dependencies, depends on the cluster. 
Please read the instructions and examples of Apache Spark on how to do this: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html.

SparkMatcher uses `graphframes` under to hood. 
Therefore, depending on the spark version, the correct version of `graphframes` needs to be added to the `external_dependencies` directory and to the configuration of the spark session.  
As a default, `graphframes` for spark 3.0 is used in the spark sessions in the notebooks in the `examples` directory. 
For a different version, see: https://spark-packages.org/package/graphframes/graphframes.

## Usage

Example notebooks are provided in the `examples` directory.
Using the SparkMatcher to find matches between Spark
dataframes `a` and `b` goes as follows:

```python
from spark_matcher.matcher import Matching

myMatcher = Matcher(spark_session, col_names=['name', 'suburb', 'postcode'])
```

Now we are ready for fitting the Matcher object using 'active learning'; this means that the user has to enter whether a
pair is a match or not. You enter 'y' if a pair is a match or 'n' when a pair is not a match. You will be notified when
the model has converged and you can stop training by pressing 'f'.

```python
myMatcher.fit(a, b)
```

The Matcher is now trained and can be used to predict on all data. This can be the data used for training or new data
that was not seen by the model yet.

```python
result = myMatcher.predict(a, b)
```


