Metadata-Version: 2.1
Name: japre
Version: 0.1.3
Summary: Custom pretokenizers for Japanese language models
Home-page: https://github.com/Alab-NII/japanese_pretokenizers
License: MIT
Author: Kaito Sugimoto
Author-email: kaito_sugimoto@is.s.u-tokyo.ac.jp
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: flake8 (>=4.0.1,<5.0.0)
Requires-Dist: fugashi (>=1.1.2,<2.0.0)
Requires-Dist: ipadic (>=1.0.0,<2.0.0)
Requires-Dist: pytextspan (>=0.5.4,<0.6.0)
Requires-Dist: tokenizers (>=0.12.1,<0.13.0)
Project-URL: Repository, https://github.com/Alab-NII/japanese_pretokenizers
Description-Content-Type: text/markdown

# japanese_pretokenizers (japre)

Custom pretokenizers for Japanese language models

## installation

```
pip install japre
```

## Usage

### IpadicPreTokenizer

```python
from japre.ipadic import IpadicPreTokenizer

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer

tokenizer_object = Tokenizer.from_file("your-awesome-tokenizer.json")
tokenizer_object.pre_tokenizer = IpadicPreTokenizer.make()
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer_object,
    unk_token='[UNK]',
    mask_token='[MASK]',
    cls_token='[CLS]',
    pad_token='[PAD]',
    sep_token='[SEP]'
)
```

### ManbyoDictPreTokenizer

```
export MANBYO_DICT_PATH=/path/to/MANBYO_201907_Dic-utf8.dic
```

```python
from japre.manbyo import ManbyoDictPreTokenizer

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer

tokenizer_object = Tokenizer.from_file("your-awesome-tokenizer.json")
tokenizer_object.pre_tokenizer = ManbyoDictPreTokenizer.make()
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer_object,
    unk_token='[UNK]',
    mask_token='[MASK]',
    cls_token='[CLS]',
    pad_token='[PAD]',
    sep_token='[SEP]'
)
```
