Metadata-Version: 2.1
Name: rdt
Version: 0.2.4.dev0
Summary: Reversible Data Transformsi
Home-page: https://github.com/sdv-dev/RDT
Author: MIT Data To AI Lab
Author-email: dailabmit@gmail.com
License: MIT license
Keywords: rdt
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.5,<3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy (<2,>=1.15.4)
Requires-Dist: pandas (<1.1,>=0.21)
Requires-Dist: scipy (<2,>=1.1.0)
Requires-Dist: Faker (<2,>=1.0.1)
Provides-Extra: dev
Requires-Dist: bumpversion (<0.6,>=0.5.3) ; extra == 'dev'
Requires-Dist: pip (>=9.0.1) ; extra == 'dev'
Requires-Dist: watchdog (<0.11,>=0.8.3) ; extra == 'dev'
Requires-Dist: m2r (<0.3,>=0.2.0) ; extra == 'dev'
Requires-Dist: Sphinx (<3,>=1.7.1) ; extra == 'dev'
Requires-Dist: sphinx-rtd-theme (<0.5,>=0.2.4) ; extra == 'dev'
Requires-Dist: autodocsumm (>=0.1.10) ; extra == 'dev'
Requires-Dist: flake8 (<4,>=3.7.7) ; extra == 'dev'
Requires-Dist: flake8-absolute-import (<2,>=1.0) ; extra == 'dev'
Requires-Dist: flake8-docstrings (<2,>=1.5.0) ; extra == 'dev'
Requires-Dist: flake8-sfs (<0.1,>=0.0.3) ; extra == 'dev'
Requires-Dist: isort (<5,>=4.3.4) ; extra == 'dev'
Requires-Dist: pylint (<3,>=2.5.3) ; extra == 'dev'
Requires-Dist: autoflake (<2,>=1.1) ; extra == 'dev'
Requires-Dist: autopep8 (<2,>=1.4.3) ; extra == 'dev'
Requires-Dist: twine (<4,>=1.10.0) ; extra == 'dev'
Requires-Dist: wheel (>=0.30.0) ; extra == 'dev'
Requires-Dist: coverage (<6,>=4.5.1) ; extra == 'dev'
Requires-Dist: tox (<4,>=2.9.1) ; extra == 'dev'
Requires-Dist: pytest (>=3.4.2) ; extra == 'dev'
Requires-Dist: pytest-cov (>=2.6.0) ; extra == 'dev'
Requires-Dist: jupyter (<2,>=1.0.0) ; extra == 'dev'
Requires-Dist: rundoc (<0.5,>=0.4.3) ; extra == 'dev'
Provides-Extra: test
Requires-Dist: pytest (>=3.4.2) ; extra == 'test'
Requires-Dist: pytest-cov (>=2.6.0) ; extra == 'test'
Requires-Dist: jupyter (<2,>=1.0.0) ; extra == 'test'
Requires-Dist: rundoc (<0.5,>=0.4.3) ; extra == 'test'

<p align="left">
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI-Lab" />
<i>An open source project from Data to AI Lab at MIT.</i>
</p>

[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
[![PyPi Shield](https://img.shields.io/pypi/v/RDT.svg)](https://pypi.python.org/pypi/RDT)
[![Travis CI Shield](https://travis-ci.org/sdv-dev/RDT.svg?branch=master)](https://travis-ci.org/sdv-dev/RDT)
[![Coverage Status](https://codecov.io/gh/sdv-dev/RDT/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/RDT)
[![Downloads](https://pepy.tech/badge/rdt)](https://pepy.tech/project/rdt)

# RDT: Reversible Data Transforms

* License: [MIT](https://github.com/sdv-dev/RDT/blob/master/LICENSE)
* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
* Documentation: https://sdv-dev.github.io/RDT
* Homepage: https://github.com/sdv-dev/RDT

## Overview

**RDT** is a Python library used to transform data for data science libraries and preserve
the transformations in order to revert them as needed.

# Install

## Requirements

**RDT** has been developed and tested on [Python 3.5, 3.6, 3.7 and 3.8](https://www.python.org/downloads/)

Also, although it is not strictly required, the usage of a [virtualenv](
https://virtualenv.pypa.io/en/latest/) is highly recommended in order to avoid
interfering with other software installed in the system where **RDT** is run.

## Install with pip

The easiest and recommended way to install **RDT** is using [pip](
https://pip.pypa.io/en/stable/):

```bash
pip install rdt
```

This will pull and install the latest stable release from [PyPi](https://pypi.org/).

If you want to install from source or contribute to the project please read the
[Contributing Guide](https://sdv-dev.github.io/RDT/contributing.html#get-started).


# Quickstart

In this short series of tutorials we will guide you through a series of steps that will
help you getting started using **RDT** to transform columns, tables and datasets.

## Transforming a column

In this first guide, you will learn how to use **RDT** in its simplest form, transforming
a single column loaded as a `pandas.DataFrame` object.

### 1. Load the demo data

You can load some demo data using the `rdt.get_demo` function, which will return some random
data for you to play with.

```python3
from rdt import get_demo

data = get_demo()
```

This will return a `pandas.DataFrame` with 10 rows and 4 columns, one of each data type supported:

```
   0_int    1_float 2_str          3_datetime
0   38.0  46.872441     b 2021-02-10 21:50:00
1   77.0  13.150228   NaN 2021-07-19 21:14:00
2   21.0        NaN     b                 NaT
3   10.0  37.128869     c 2019-10-15 21:39:00
4   91.0  41.341214     a 2020-10-31 11:57:00
5   67.0  92.237335     a                 NaT
6    NaN  51.598682   NaN 2020-04-01 01:56:00
7    NaN  42.204396     c 2020-03-12 22:12:00
8   68.0        NaN     c 2021-02-25 16:04:00
9    7.0  31.542918     a 2020-07-12 03:12:00
```

Notice how the data is random, so your output might look a bit different. Also notice how
RDT introduced some null values randomly.

### 2. Load the transformer

In this example we will use the datetime column, so let's load a `DatetimeTransformer`.

```python3
from rdt.transformers import DatetimeTransformer

transformer = DatetimeTransformer()
```

### 3. Fit the Transformer

Before being able to transform the data, we need the transformer to learn from it.

We will do this by calling its `fit` method passing the column that we want to transform.

```python3
transformer.fit(data['3_datetime'])
```

### 4. Transform the data

Once the transformer is fitted, we can pass the data again to its `transform` method in order
to get the transformed version of the data.

```python3
transformed = transformer.transform(data['3_datetime'])
```

The output will be a `numpy.ndarray` with two columns, one with the datetimes transformed
to integer timestamps, and another one indicating with 1s which values were null in the
original data.

```
array([[1.61299380e+18, 0.00000000e+00],
       [1.62672924e+18, 0.00000000e+00],
       [1.59919923e+18, 1.00000000e+00],
       [1.57117554e+18, 0.00000000e+00],
       [1.60414542e+18, 0.00000000e+00],
       [1.59919923e+18, 1.00000000e+00],
       [1.58570616e+18, 0.00000000e+00],
       [1.58405112e+18, 0.00000000e+00],
       [1.61426904e+18, 0.00000000e+00],
       [1.59452352e+18, 0.00000000e+00]])
```

### 5. Revert the column transformation

In order to revert the previous transformation, the transformed data can be passed to
the `reverse_transform` method of the transformer:

```python3
reversed_data = transformer.reverse_transform(transformed)
```

The output will be a `pandas.Series` containing the reverted values, which should be exactly
like the original ones.

```
0   2021-02-10 21:50:00
1   2021-07-19 21:14:00
2                   NaT
3   2019-10-15 21:39:00
4   2020-10-31 11:57:00
5                   NaT
6   2020-04-01 01:56:00
7   2020-03-12 22:12:00
8   2021-02-25 16:04:00
9   2020-07-12 03:12:00
dtype: datetime64[ns]
```

## Transforming a table

Once we know how to transform a single column, we can try to go the next level and transform
a table with multiple columns.

### 1. Load the HyperTransformer

In order to manuipulate a complete table we will need to load a `rdt.HyperTransformer`.

```python3
from rdt import HyperTransformer

ht = HyperTransformer()
```

### 2. Fit the HyperTransformer

Just like the transfomer, the HyperTransformer needs to be fitted before being able to transform
data.

This is done by calling its `fit` method passing the `data` DataFrame.

```python3
ht.fit(data)
```

### 3. Transform the table data

Once the HyperTransformer is fitted, we can pass the data again to its `transform` method in order
to get the transformed version of the data.

```python3
transformed = ht.transform(data)
```

The output, will now be another `pandas.DataFrame` with the numerical representation of our
data.

```
    0_int  0_int#1    1_float  1_float#1  2_str    3_datetime  3_datetime#1
0  38.000      0.0  46.872441        0.0   0.70  1.612994e+18           0.0
1  77.000      0.0  13.150228        0.0   0.90  1.626729e+18           0.0
2  21.000      0.0  44.509511        1.0   0.70  1.599199e+18           1.0
3  10.000      0.0  37.128869        0.0   0.15  1.571176e+18           0.0
4  91.000      0.0  41.341214        0.0   0.45  1.604145e+18           0.0
5  67.000      0.0  92.237335        0.0   0.45  1.599199e+18           1.0
6  47.375      1.0  51.598682        0.0   0.90  1.585706e+18           0.0
7  47.375      1.0  42.204396        0.0   0.15  1.584051e+18           0.0
8  68.000      0.0  44.509511        1.0   0.15  1.614269e+18           0.0
9   7.000      0.0  31.542918        0.0   0.45  1.594524e+18           0.0
```

### 4. Revert the table transformation

In order to revert the transformation and recover the original data from the transformed one,
we need to call `reverse_transform` method of the `HyperTransformer` instance passing it the
transformed data.

```python3
reversed_data = ht.reverse_transform(transformed)
```

Which should output, again, a table that looks exactly like the original one.

```
   0_int    1_float 2_str          3_datetime
0   38.0  46.872441     b 2021-02-10 21:50:00
1   77.0  13.150228   NaN 2021-07-19 21:14:00
2   21.0        NaN     b                 NaT
3   10.0  37.128869     c 2019-10-15 21:39:00
4   91.0  41.341214     a 2020-10-31 11:57:00
5   67.0  92.237335     a                 NaT
6    NaN  51.598682   NaN 2020-04-01 01:56:00
7    NaN  42.204396     c 2020-03-12 22:12:00
8   68.0        NaN     c 2021-02-25 16:04:00
9    7.0  31.542918     a 2020-07-12 03:12:00
```

# What's next?

For more details about **Reversible Data Transforms**, how to contribute to the project, and
its complete API reference, please visit the [documentation site](
https://sdv-dev.github.io/RDT/).


# History

## 0.2.3 - 2020-07-09

* Implement OneHot and Label encoding as transformers - Issue [#112](https://github.com/sdv-dev/RDT/issues/112) by @csala

## 0.2.2 - 2020-06-26

### Bugs Fixed

* Escape `column_name` in hypertransformer - Issue [#110](https://github.com/sdv-dev/RDT/issues/110) by @csala

## 0.2.1 - 2020-01-17

### Bugs Fixed

* Boolean Transformer fails to revert when there are NO nulls - Issue [#103](https://github.com/sdv-dev/RDT/issues/103) by @JDTheRipperPC

## 0.2.0 - 2019-10-15

This version comes with a brand new API and internal implementation, removing the old
metadata JSON from the user provided arguments, and making each transformer work only
with `pandas.Series` of their corresponding data type.

As part of this change, several transformer names have been changed and a new BooleanTransformer
and a feature to automatically decide which transformers to use based on dtypes have been added.

Unit test coverage has also been increased to 100%.

Special thanks to @JDTheRipperPC and @csala for the big efforts put in making this
release possible.

### Issues

* Drop the usage of meta - Issue [#72](https://github.com/sdv-dev/RDT/issues/72) by @JDTheRipperPC
* Make CatTransformer.probability_map deterministic - Issue [#25](https://github.com/sdv-dev/RDT/issues/25) by @csala

## 0.1.3 - 2019-09-24

### New Features

* Add attributes NullTransformer and col_meta - Issue [#30](https://github.com/sdv-dev/RDT/issues/30) by @ManuelAlvarezC

### General Improvements

* Integrate with CodeCov - Issue [#89](https://github.com/sdv-dev/RDT/issues/89) by @csala
* Remake Sphinx Documentation - Issue [#96](https://github.com/sdv-dev/RDT/issues/96) by @JDTheRipperPC
* Improve README - Issue [#92](https://github.com/sdv-dev/RDT/issues/92) by @JDTheRipperPC
* Document RELEASE workflow - Issue [#93](https://github.com/sdv-dev/RDT/issues/93) by @JDTheRipperPC
* Add support to Python 3.7 - Issue [#38](https://github.com/sdv-dev/RDT/issues/38) by @ManuelAlvarezC
* Create way to pass HyperTransformer table dict - Issue [#45](https://github.com/sdv-dev/RDT/issues/45) by @ManuelAlvarezC

## 0.1.2

* Add a numerical transformer for positive numbers.
* Add option to anonymize data on categorical transformer.
* Move the `col_meta` argument from method-level to class-level.
* Move the logic for missing values from the transformers into the `HyperTransformer`.
* Removed unreacheble lines in `NullTransformer`.
* `Numbertransfomer` to set default value to 0 when the column is null.
* Add a CLA for collaborators.
* Refactor performance-wise the transformers.

## 0.1.1

* Improve handling of NaN in NumberTransformer and CatTransformer.
* Add unittests for HyperTransformer.
* Remove unused methods `get_types` and `impute_table` from HyperTransformer.
* Make NumberTransformer enforce dtype int on integer data.
* Make DTTransformer check data format before transforming.
* Add minimal API Reference.
* Merge `rdt.utils` into `HyperTransformer` class. 

## 0.1.0

* First release on PyPI.


