Metadata-Version: 2.1
Name: russian-paraphrasers
Version: 0.0.1
Summary: Russian Paraphrasers (based on ru-gpt, mt5)
Home-page: https://github.com/RussianNLP/russian_paraphrasers
Author: Alenusch
Author-email: alenush93@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: rusenttokenize
Requires-Dist: transformers (>=4.0.1)

# Russian Paraphrasers

The library for Russian paraphrase generation.
Paraphrase generation is an increasingly popular task in NLP that can be used in many areas:

- style transfer: 
    - translation from rude to polite
    - translation from professional to simple language
- data augmentation: increasing the number of examples for training ML-models
- increasing the stability of ML-models: training models on a wide variety of examples, in different styles, with different sentiment, but the same meaning / intent of the user

## Install

```
pip install --upgrade pip
pip install -r requirements.txt
pip install russian_paraphrasers
```

Requirements.txt:
```
sentence-transformers==0.4.0
transformers>=4.0.1
git+https://github.com/Maluuba/nlg-eval.git@master
```

## Usage

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IjBeV--kiBoPQM6bqg9h2cX4Vhf1ofNK?usp=sharing)

1) First, import one of the models and set general parameters for your paraphraser:

```
from russian_paraphrasers import GPTParaphraser

paraphraser = GPTParaphraser(model_name="gpt2", range_cand=False, make_eval=False)
```

```
from russian_paraphrasers import Mt5Paraphraser

paraphraser = Mt5Paraphraser(model_name="mt5-base", range_cand=False, make_eval=False)
```

You can choose 1) to filter candidates or not 2) to add some evaluation of best candidates or all `n` samples.

Arguments:
- model_name: `mt5-small`, `mt5-base`, `mt5-large`, `gpt2`
- range_cand: `True/False`
- make_eval: `True/False`

2) Pass sentence (obligatory) and parameters for generating to generate function and see the results.

```
sentence = "Мама мыла раму."
results = paraphraser.generate(
    sentence, n=10, temperature=1, 
    top_k=10, top_p=0.9, 
    max_length=100, repetition_penalty=1.5
)
```

Results for one sentence look like this:

```
{'average_metrics': {'Bleu_1': 0.06666666665333353,
                     'Bleu_2': 2.3570227263379004e-09,
                     'Bleu_3': 8.514692649183842e-12,
                     'Bleu_4': 5.665278056606597e-13,
                     'ROUGE_L': 0.07558859975216851},
 'best_candidats': ['В чём цель существования человека?',
                    'Для чего нужна жизнь?',
                    'Что такое жизнь в смысле смысла ее существования, и зачем '
                    'она нужна человеку.'],
 'predictions': ['В чём счастье людей, проживающих в мире сегодня',
                 'В чём счастье человека?)',
                 'Для чего нужна жизнь и какова цель ее существования?',
                 'Что означает фраза в том чтобы жить жизнью?',
                 'В чём ценность человеческой Жизни?',
                 'В чём счастье людей в мире? и т. д.',
                 'Зачем нужна жизнь и что в ней главное докуменция дл',
                 'В чём цель существования человека?',
                 'Что такое жизнь в смысле смысла ее существования, и зачем '
                 'она нужна человеку.',
                 'Для чего нужна жизнь?']
}
```


## Models

All models were fine-tuned on the same dataset (see below) and uploaded to hugging_face.
Available models:
- [rugpt2-large](https://huggingface.co/sberbank-ai/rugpt2large) trained by Sberbank team https://github.com/sberbank-ai/ru-gpts
- [mt5-small](https://huggingface.co/google/mt5-small)
- [mt5-base](https://huggingface.co/google/mt5-base)
- [mt5-large](https://huggingface.co/google/mt5-large)

To be continued... =)

## Dataset

All models were finetuned on the dataset based on two parts:

1) part of the [ParaPhraser data](http://paraphraser.ru/download/), about 200k filtered examples
2) filtered questions to chatbots and filtered subtitles from [here](https://github.com/rysshe/paraphrase/tree/master/data)

The dataset will be available soon as well as the article with all the details.

