Metadata-Version: 2.1
Name: SEFR-CUT
Version: 1.1
Summary: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP2020)
Home-page: https://github.com/mrpeerat/SEFR_CUT
Author: mrpeerat
Author-email: peerat.l_s19@vistec.ac.th
License: MIT
Keywords: thai word segmentation,word segmentation,thainlp
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Natural Language :: Thai
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Text Processing :: Linguistic
Description-Content-Type: text/markdown
Requires-Dist: tensorflow (>=2.0.0)
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: python-crfsuite
Requires-Dist: pyahocorasick

# SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation) 
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020) <br>
CRF as Stacked Model and DeepCut as Baseline model<br>

## Read more:
- Paper: [Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble]()
- Blog: [Domain Adaptation กับตัวตัดคำ มันดีย์จริงๆ](https://medium.com/@pingloaf)

## Install
> pip install sefr_cut

## How To use
### Requirements
- python >= 3.6
- python-crfsuite >= 0.9.7
- pyahocorasick == 1.4.0

## Example
- Example files are on [SEFR Example notebook](https://github.com/mrpeerat/SEFR_CUT/blob/master/Notebooks/1.SEFR_CUT%20example.ipynb)
- [Try it on Colab](https://colab.research.google.com/drive/1xA2rzYVnVWwxy6oFkISiG63x-5u1gwa1?usp=sharing)
### Load Engine & Engine Mode
- ws1000, tnhc
  - ws1000: Model trained on Wisesight-1000 and test on Wisesight-160
  - tnhc: Model trained on TNHC (80:20 train&test split with random seed 42)
  - BEST: Trained on BEST-2010 Corpus (NECTEC)
  ```
  sefr_cut.load_model(engine='ws1000')
  # OR
  sefr_cut.load_model(engine='tnhc')
  # OR
  sefr_cut.load_model(engine='best')
  ```
- tl-deepcut-XXXX
  - We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
  ```
  sefr_cut.load_model(engine='tl-deepcut-ws1000')
  # OR
  sefr_cut.load_model(engine='tl-deepcut-tnhc')
  ```
- deepcut
  - We also provide the original deepcut
  ```
  sefr_cut.load_model(engine='deepcut')
  ```
### Segment Example
- Segment with default k
  ```
  sefr_cut.load_model(engine='ws1000')
  print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ']))
  print(sefr_cut.tokenize(['สวัสดีประเทศไทย']))
  print(sefr_cut.tokenize('สวัสดีประเทศไทย'))

  [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
  [['สวัสดี', 'ประเทศ', 'ไทย']]
  [['สวัสดี', 'ประเทศ', 'ไทย']]
  ```
- Segment with different k
  ```
  sefr_cut.load_model(engine='ws1000')
  print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number
  print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number

  [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']]
  [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
  ```

## Evaluation
- Character & Word Evaluation is provided by call fuction ```evaluation()``` 
  - For example
  ```
  answer = 'สวัสดี|ประเทศไทย'
  pred = 'สวัสดี|ประเทศ|ไทย'
  char_score,word_score = sefr_cut.evaluation(answer,pred)
  print(f'Word Score: {word_score} Char Score: {char_score}')

  Word Score: 0.4 Char Score: 0.8

  answer = ['สวัสดี|ประเทศไทย']
  pred = ['สวัสดี|ประเทศ|ไทย']
  char_score,word_score = sefr_cut.evaluation(answer,pred)
  print(f'Word Score: {word_score} Char Score: {char_score}')

  Word Score: 0.4 Char Score: 0.8


  answer = [['สวัสดี|'],['ประเทศไทย']]
  pred = [['สวัสดี|'],['ประเทศ|ไทย']]
  char_score,word_score = sefr_cut.evaluation(answer,pred)
  print(f'Word Score: {word_score} Char Score: {char_score}')

  Word Score: 0.4 Char Score: 0.8
  ```

## Performance
<img src="https://user-images.githubusercontent.com/21156980/94525454-4d2e6680-025e-11eb-929f-7bcbb76e92fd.PNG" width="600" height="386" />
<img src="https://user-images.githubusercontent.com/21156980/94525459-4e5f9380-025e-11eb-9ce6-fd1598b902eb.PNG" width="600" height="386" />
<img src="https://user-images.githubusercontent.com/21156980/94525741-b9a96580-025e-11eb-81f1-1016e59e25cf.PNG" width="600" height="306" />

## How to re-train?
- You can re-train model in folder [Notebooks](https://github.com/mrpeerat/SEFR_CUT/tree/master/Notebooks) We provided everything for you!!
  ### Re-train Model
  - You can run the notebook file #2, the corpus inside 'Notebooks/corpus/' is Wisesight-1000, you can try with BEST, TNHC, and LST20 !
  - Rename variable name ```CRF_model_name``` 
  - Link:[HERE](https://github.com/mrpeerat/SEFR_CUT/blob/master/Notebooks/2.Train_DS_model.ipynb)
  ### Filter and Refine Example
  - Set variable name ```CRF_model_name``` same as File#2 
  - If you want to know why we use filter-and-refine you can try to uncomment 3 lines in ```score_()``` function
  ```
  #answer = scoring_function(y_true,cp.deepcopy(y_pred),entropy_index_og)
  #f1_hypothesis.append(eval_function(y_true,answer))
  #ax.plot(range(start,K_num,step),f1_hypothesis,c="r",marker='o',label='Best case')
  ```
  - Link:[HERE](https://github.com/mrpeerat/SEFR_CUT/blob/master/Notebooks/3.Stacked%20Model%20Example.ipynb)
  ### Use your own model?
  - Just move your model inside 'Notebooks/model/' to 'seft_cut/model/' and call model in one line.
  ```
  SEFR_CUT.load_model(engine='my_model')
  ```

## Citation
- Wait our paper shown in ACL Anthology

Thank you many code from

- [Deepcut](https://github.com/rkcosmos/deepcut) (Baseline Model) : We used some of code from Deepcut to perform transfer learning 
- [@bact](https://github.com/bact) (CRF training code) : We used some from https://github.com/bact/nlp-thai in training CRF Model




