Metadata-Version: 2.1
Name: dingo-python
Version: 1.0.4
Summary: Language quality evaluation tool.
Home-page: https://github.com/shijinpjlab/Dingo/main
Author: SH AI Lab
Author-email: shailab@pjlab.org.cn
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: hanziconv
Requires-Dist: jieba
Requires-Dist: jsonlines
Requires-Dist: langid
Requires-Dist: numpy ==1.26.4
Requires-Dist: textstat
Requires-Dist: zhon
Requires-Dist: transformers
Requires-Dist: toml
Requires-Dist: pydantic
Requires-Dist: requests
Requires-Dist: Pillow ==9.4.0
Requires-Dist: opencv-python
Requires-Dist: nltk
Requires-Dist: datasets
Requires-Dist: chardet
Requires-Dist: packaging
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: prettytable
Requires-Dist: fasttext-wheel ==0.9.2
Requires-Dist: wordninja ==2.0.0
Requires-Dist: huggingface-hub
Requires-Dist: fastapi
Requires-Dist: uvicorn

English | [简体中文](README_ZH.md)

# Introduction

Dingo is a data quality assessment tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in detection rules and model methods, and also supports custom detection methods. It supports commonly used NLP datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports various interface usage methods, including local CLI, SDK, and RESTFul API, making it easy to integrate into various evaluation platforms, such as [OpenCompass](https://github.com/open-compass/opencompass)，[simple-evals](https://github.com/openai/simple-evals) etc.

## Architecture of Dingo

![Architecture of dingo](./docs/assets/architeture.png)

# QuickStart

Install `dingo`.
```shell
pip install dingo-python
```

Try the following `SDK` demo code:
```python
from dingo.model import Model
from dingo.io import InputArgs
from dingo.exec import Executor

input_data = {
    "eval_models": ["sft"],
    "input_path": "tatsu-lab/alpaca", # default from huggingface
    "data_format": "plaintext",
}

input_args = InputArgs(**input_data)
Model.apply_config(input_args.custom_config_path)
executor = Executor.exec_map["local"](input_args)
result = executor.evaluate()
print(result)
```

you can also try `CLI`:
```shell
python -m dingo.run.cli --input_path tatsu-lab/alpaca -e sft --data_format plaintext
```

# Tutorials

## [Config](docs/config.md)

## Execute

`Dingo` can be run locally or on a Spark cluster.

### Local Mode

In addition to the aforementioned SDK calls, you can also run data evaluation locally with `CLI`:

```shell
python -m dingo.run.cli
```

The CLI parameters are as follows.

| parameter name              | description                                                         |
|-----------------------------|---------------------------------------------------------------------|
| `-e` or `--eval_models`     | The model used to evaluate data quality.                            |
| `-i` or `--input_path`      | The path of data. It can be a file or a directory.                  |
| `--output_path`             | The path of result data.                                            |
| `--data_format`             | The format of data. It can be json, jsonl, plaintext and list json. |
| `--dataset`                 | The platform for data run. It can be huggingface, local and spark.  |
| `--datasource`              | The source of data. It can be huggingface, local and s3.            |
| `--huggingface_split`       | The split of huggingface.                                           |
| `--column_id`               | The column name of id in data.                                      |
| `--column_prompt`           | The column name of prompt in data.                                  |
| `--column_content`          | The column name of content in data.                                 |
| `--custom_config_path`      | The path of custom config file.                                     |
| `--spark_master_url`        | The url of spark master.                                            |
| `--spark_summary_save_path` | The path of summary saved when run in spark.                        |
| `--s3_ak`                   | The ak of s3.                                                       |
| `--s3_sk`                   | The sk of s3.                                                       |
| `--s3_endpoint_url`         | The url of end point in s3.                                         |
| `--s3_addressing_style`     | The style of addressing in s3.                                      |
| `--s3_bucket`               | The bucket of s3.                                                   |

More information can be obtained by running the following command: `python -m dingo.run.cli --help`.

### Spark Mode

If the scale of data is very large you can use Spark to run the project. 

Firstly, create an object from `SparkExecutor`, and set the actual instances of SparkSession and DataFrame.

```python
from dingo.exec.spark import SparkExecutor

spark_exec = SparkExecutor()
spark_exec.set_spark(spark_session)
spark_exec.set_input_df(spark_data_frame)
```

Then, convert the data and execute the rule list.

```python
spark_exec.convert_data(column_id=['data_id'], column_prompt=['prompt'], column_content=['content'])
spark_exec.execute(["CommonSpecialCharacter", "CommonColonEnd"])

```

Finally, summarize and get the result data.

```python
spark_exec.summarize()
output_df = spark_exec.get_output_df()
summary = spark_exec.get_summary()
```


## Evaluation Results

### Summary

The `summary.json` file is overall information about evaluation results. Here is an example: 

```json
{
    "dataset_id": "20240816_175052",
    "input_model": "default",
    "input_path": "test/data/test_local_json.json",
    "output_path": "test/outputs/20240816_175052",
    "score": 0.0,
    "num_good": 0,
    "num_bad": 2,
    "total": 2,
    "error_type_ratio": {
        "QUALITY_INEFFECTIVENESS": 0.0,
        "QUALITY_INCOMPLETENESS": 0.0,
        "QUALITY_DISUNDERSTANDABILITY": 0.0,
        "QUALITY_DISSIMILARITY": 0.0,
        "QUALITY_DISFLUENCY": 0.0,
        "QUALITY_IRRELEVANCE": 1.0,
        "QUALITY_INSECURITY": 0.0
    },
    "error_name_ratio": {
        "QUALITY_IRRELEVANCE-CommonSpecialCharacter": 1.0
    }
}
```

The `error_ratio` field shows data quality signals in seven different aspects: 
`EFFECTIVENESS`, `COMPLETENESS`, `UNDERSTANDABILITY`, `SIMILARITY`, `FLUENCY`, `RELEVANCE` and `SECURITY`.

### Detailed Results

For more detailed issues found in data items, `Dingo` created files in a directory named with the quality signals mentioned above.
Give an example. `CommonColonEnd.json` in the `QUALITY_SIGNAL_COMPLETENESS` directory is as follows:

```json lines
{"data_id": "0", "prompt": "", "content": "�I am 8 years old. ^I love apple because: fuck you", "error_type": ["QUALITY_IRRELEVANCE"], "error_name": ["QUALITY_IRRELEVANCE-CommonSpecialCharacter"], "error_reason": ["�"]}
{"data_id": "1", "prompt": "", "content": "�[I like blue best. Because blue is the color of the sky. ", "error_type": ["QUALITY_IRRELEVANCE"], "error_name": ["QUALITY_IRRELEVANCE-CommonSpecialCharacter"], "error_reason": ["�"]}

```

We evaluated the quality of these three datasets based on `Dingo`.

| Dataset         | Dataset Type | EFFECTIVENESS | COMPLETENESS | UNDERSTANDABILITY | SIMILARITY | FLUENCY  | RELEVANCE | SECURITY |
|-----------------|--------------|---------------|--------------|-------------------|------------|----------|-----------|----------|
| SlimPajama-627B | Pretrain     | 0.016860      | 0.000175     | 0.002062          | 0.003563   | 0.000302 | 0.003767  | 0        |
| Stanford_alpaca | SFT          | 0.001442      | 0.000538     | 0.000481          | 0.000231   | 0        | 0         | 0        |
| MMLU            | Benchmark    | 0.011759      | 0.007349     | 0                 | 0          | 0        | 0         | 0        |

# [Rule List](docs/rule_list.md)



# Contributing
We appreciate all contributions to `Dingo`. Please refer to [CONTRIBUTING.md](docs/en/CONTRIBUTING.md) for the contributing guideline.

# License
This project is released under the [Apache 2.0 license](LICENSE).

