Metadata-Version: 2.1
Name: poros
Version: 0.0.53
Summary: some useful code
Home-page: https://github.com/diqiuzhuanzhuan/poros
Author: Feynman
Author-email: diqiuzhuanzhuan@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: tensorflow (>=2.2.0)
Requires-Dist: tensorflow-addons
Requires-Dist: matplotlib
Requires-Dist: seqeval
Requires-Dist: tf2crf


This is a project for later lazy work!
Only support for python3, ☹️, but maybe you can try in python2

# Install
命令行直接安装
```bash
pip install poros
```
从代码库安装
```bash
git clone https://github.com/diqiuzhuanzhuan/poros.git
cd poros
python setup install
```


Some code is from other people, and some is from me.

# unilmv2
- create pretraining data
```python
from poros.unilmv2.dataman import PreTrainingDataMan
# vocab_file: make sure [SOS],[EOS] and [Pseudo] are in vocab_file
vocab_file = "vocab_file"# your vocab file
ptdm = PreTrainingDataMan(vocab_file=vocab_file, max_seq_length=128, max_predictions_per_seq=20, random_seed=2334)
input_file = "my_input_file" #file format is like bert
output_file = "my_output_file" #the output file is a tfrecord file
ptdm.create_pretraining_data(input_file, output_file)
dataset = ptdm.read_data_from_tfrecord(output_file, is_training=True, batch_size=8)
```
- create unilmv2 model and train it
```python
from poros.unilmv2.config import Unilmv2Config
from poros.unilmv2 import Unilmv2Model
from poros_train import optimization

"""
the configuration is like this:
{
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi", 
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768, 
  "initializer_range": 0.08,
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2, 
  "vocab_size": 21131
}
A json file recording these configuration is recommended.
"""
json_file = "my_config_file"
unilmv2_config = Unilmv2Config.from_json_file(json_file)
unilmv2_model = Unilmv2Model(config=unilmv2_config, is_training=True)
epoches=2000
steps_per_epoch=15
optimizer = optimization.create_optimizer(init_lr=6e-4, num_train_steps=epoches * steps_per_epoch, num_warmup_steps=1500)
unilmv2_model.compile(optimizer=optimizer)
unilmv2_model.fit(dataset, epochs=epoches, steps_per_epoch=15)
```

# bert
usage:
- create pretrain data

```python
from poros.bert import create_pretraining_data

>>> create_pretraining_data.main(input_file="./test_data/sample_text.txt",
output_file="./test_data/output", vocab_file="./test_data/vocab.txt")
```

- pretrain bert model
```python
from poros.bert_model import pretrain
>>> pretrain.run(input_file="./test_data/output",  bert_config_file="./test_data/bert_config.json", 
output_dir="./output")

```
- prepare a trained model, tell classifier model
- prepare train.csv and test.csv, its format is like this: "id, text1, label", but remember no header!
- init the model, the code is like below
````python
from poros.bert_model.run_classifier import SimpleClassifierModel
>>> model = SimpleClassifierModel(
    bert_config_file="./data/chinese_L-12_H-768_A-12/bert_config.json",      
     vocab_file="./data/chinese_L-12_H-768_A-12/vocab.txt",                   
     output_dir="./output",                                                   
     max_seq_length=512,                                                      
     train_file="./data/train.csv",                                           
     dev_file="./data/dev.csv",                                               
     init_checkpoint="./data/chinese_L-12_H-768_A-12/bert_model.ckpt",        
     label_list=["0", "1", "2", "3"]                                                  
    )
````

# poros_dataset
some operations about tensor
```python
from poros.poros_dataset import about_tensor
import tensorflow as tf
>>> A = tf.constant(value=[0])
>>> print(about_tensor.get_shape(A))
[1]

```

# poros_chars
Provide a list of small functions

usage:
- convert chinese words into arabic number:
```python
from poros.poros_chars import chinese_to_arabic
>>> print(chinese_to_arabic.NumberAdapter.convert("四千三百万"))
43000000

```

# Thanks
PyCharm

