Metadata-Version: 2.3
Name: burmese-tokenizer
Version: 0.1.3
Summary: A simple tokenizer for Burmese text
Keywords: burmese,tokenizer,nlp,myanmar,text-processing
Author: janakhpon
Author-email: janakhpon <jnovaxer@gmail.com>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: click>=8.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: twine>=6.1.0
Requires-Dist: pytest>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: black>=23.0.0 ; extra == 'dev'
Requires-Dist: isort>=5.12.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1.0 ; extra == 'dev'
Requires-Dist: sphinx>=7.0.0 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.3.0 ; extra == 'docs'
Requires-Python: >=3.11
Project-URL: Changelog, https://github.com/Code-Yay-Mal/burmese_tokenizer/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/Code-Yay-Mal/burmese_tokenizer#readme
Project-URL: Homepage, https://github.com/Code-Yay-Mal/burmese_tokenizer
Project-URL: Issues, https://github.com/Code-Yay-Mal/burmese_tokenizer/issues
Project-URL: Repository, https://github.com/Code-Yay-Mal/burmese_tokenizer
Provides-Extra: dev
Provides-Extra: docs
Description-Content-Type: text/markdown

# Burmese Tokenizer

Simple, fast Burmese text tokenization. No fancy stuff, just gets the job done.

## Install

```bash
pip install burmese-tokenizer
```

## Quick Start

```python
from burmese_tokenizer import BurmeseTokenizer

tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"

# tokenize
tokens = tokenizer.encode(text)
print(tokens["pieces"])
# ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']

# decode back
text = tokenizer.decode(tokens["pieces"])
print(text)
# မင်္ဂလာပါ။ နေကောင်းပါသလား။
```

## CLI

```bash
# tokenize
burmese-tokenizer "မင်္ဂလာပါ။"

# show details
burmese-tokenizer -v "မင်္ဂလာပါ။"

# decode tokens
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"
```

## API

- `encode(text)` - tokenize text
- `decode(pieces)` - convert tokens back to text  
- `decode_ids(ids)` - convert ids to text
- `get_vocab_size()` - vocabulary size
- `get_vocab()` - full vocabulary

## Links

- [PyPI](https://pypi.org/project/burmese-tokenizer/)
- [Contributing](docs/how_to_contribute.md)

## License

MIT - Do whatever you want with it.