Metadata-Version: 2.1
Name: calista
Version: 0.3.3
Summary: Comprehensive Python package designed to simplify data quality checks across multiple platforms
License: UNKNOWN
Project-URL: Repository, https://github.com/Aubay-Data-AI/calista
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9.5
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.2.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: pydantic>=2.8.0
Requires-Dist: overrides>=7.7.0
Requires-Dist: schema>=0.7.5
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: polars<1.7,>=0.20.21
Provides-Extra: bigquery
Requires-Dist: sqlalchemy>=2.0.29; extra == "bigquery"
Requires-Dist: sqlalchemy-bigquery>=1.11.0; extra == "bigquery"
Provides-Extra: snowflake
Requires-Dist: snowflake-snowpark-python>=1.12.1; extra == "snowflake"
Provides-Extra: spark
Requires-Dist: pyspark>=3.5.0; extra == "spark"
Requires-Dist: delta-spark>=3.1.0; extra == "spark"

![Python](https://img.shields.io/badge/python-3.10-blue.svg)
[![Tests](https://github.com/Aubay-Data-AI/calista/actions/workflows/tests.yml/badge.svg)](https://github.com/Aubay-Data-AI/calista/actions)
![License](https://img.shields.io/badge/License-Apache-blue.svg)
![Black](https://img.shields.io/badge/code_style-Black-black.svg)
[![Aubay](https://img.shields.io/badge/aubay-8A2BE2)](https://data.aubay.com/)

<div style="text-align:center;">
    <img src="ressources/calista_logo.png" alt="Logo calista" />
</div>

Table of contents
- [Calista](#calista)
  - [Installing from PyPI](#installing-from-pypi)
  - [Getting Started](#getting-started)
    - [Example](#example)
  - [Documentation](#documentation)
  - [License](#license)


# Calista
__Calista__ is a comprehensive Python package designed to simplify data quality checks across multiple platforms using a consistent syntax. Inspired by modular libraries, __Calista__ aims to streamline data quality tasks by providing a unified interface.

Built on popular Python libraries like Pyspark and SQLAlchemy, __Calista__ leverages their capabilities for efficient large-scale data processing. By abstracting engine-specific complexities, __Calista__ allows users to focus on data quality without dealing with implementation details.

At its core, __Calista__ offers a cohesive set of classes and methods that consolidate functionalities from various engine-specific modules. Users can seamlessly execute operations typically associated with Spark or SQL engines through intuitive __Calista__ interfaces.

Currently developed in Python 3.10, __Calista__ supports data quality checks using engines such as Spark, Pandas, Polars, Snowflake and BigQuery.

Whether orchestrating data pipelines or conducting assessments, __Calista__ provides the tools needed to navigate complex data quality checks with ease and efficiency.

## Installing from PyPI

To use our framework, simply install it via pip. This command will install the framework along with the default engines pandas and polars:

```bash
pip install calista
```
If you require support for another engines such as Snowflake, Spark, or BigQuery, use the following command and replace _EngineName_ with the name of your desired engine:

```bash
pip install calista[EngineName]
```
## Getting Started

To start using __Calista__, import the appropriate class:

```
from calista import CalistaEngine
```

With __Calista__, you can easily analyze your data quality, regardless of the underlying engine. The unified API streamlines your workflow and enables seamless integration across different environments.


### Example

Here's an example using the Pandas Engine. Suppose you have a dataset represented as a table:

| ID        | status    |last increase | salary  |
|-----------|-----------|-----------|-----------|
| 0         |Célibataire|2022-12-31 | 36000     |
| 1         |           |2023-12-31 | 53000     |
| 2         | Marié     |2018-12-31 | 28000     |

You can load this table using CalistaEngine with the Pandas engine:
```
from calista import CalistaEngine

table_pandas = CalistaEngine(engine="pandas").load(path="examples/demo_new_model.csv", file_format="parquet")
```

You can define custom rules using __Calista__ functions to analyze specific conditions within your data:
```
from calista import functions as F

my_rule = F.is_not_null(col_name="status") & F.is_integer("salary")

print(table.analyze(rule_name="demo_new_model", rule=my_rule))
```

The output of the analysis provides insights into data quality based on the defined rule:
```
rule_name : demo_new_model
total_row_count : 3
valid_row_count : 2
valid_row_count_pct : 66.66
timestamp : 2024-04-23 10:00:59.449193
```

You can also just enhance your data by applying the rule:
```
from calista import functions as F

my_rule = F.is_not_null(col_name="status") & F.is_integer("salary")

print(table.apply_rule(rule_name="demo_new_model", rule=my_rule))
```

When printing, you'll get the following result:

| ID        | status    |last increase | salary  | demo_new_model
|-----------|-----------|-----------|-----------|----------------|
| 0         |Célibataire|2022-12-31 | 36000     | True     |
| 1         |           |2023-12-31 | 53000     | False     |
| 2         | Marié     |2018-12-31 | 28000     | True     |

You also have the possibility to only retrieve the data that validate or invalidate the rule. For example, to get data invalidating the rule:
```
from calista import functions as F

my_rule = F.is_not_null(col_name="status") & F.is_integer("salary")

print(table.get_invalid_rows(rule=my_rule))
```

When printing, you'll get the following result:

| ID        | status    |last increase | salary  | demo_new_model
|-----------|-----------|-----------|-----------|----------------|
| 1         |           |2023-12-31 | 53000     | False     |
## Documentation
[Calista docs](https://calista.readthedocs.io/en/latest/)
## License
Licensed under the Apache License


