Metadata-Version: 2.1
Name: pycode2seq
Version: 0.0.6
Summary: Inference and training for multiple languages of code2seq
Home-page: https://github.com/kisate/pycode2seq
Author: Dmitrii Kharlapenko
Author-email: dimkakha@gmail.com
License: MIT
Download-URL: https://pypi.org/project/pycode2seq/
Keywords: code2seq,pytorch,pytorch-lightning,ml4code,ml4se
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: torch (>=1.9.0)
Requires-Dist: torchtext (>=0.10.0)
Requires-Dist: pytorch-lightning (>=1.3.5)
Requires-Dist: code2seq (==0.0.2)
Requires-Dist: antlr4-python3-runtime (==4.8)
Requires-Dist: setuptools (>=52.0.0)
Requires-Dist: tqdm (==4.58.0)
Requires-Dist: numpy (>=1.20.1)
Requires-Dist: regex (>=2019.11.1)
Requires-Dist: omegaconf (~=2.0.6)
Requires-Dist: dataclasses (~=0.6)
Requires-Dist: requests (~=2.25.1)

# pycode2seq

Pure Python library for `code2seq` embeddings. 

Support extension of existing pretrained code2seq embeddings to multilingual models. 
We provided an example of the Java model extension with Kotlin.
Pretrained model and its usage example provided below.  

## Installation

```shell
pip install pycode2seq
```

## Inference

#### File embeddings example

```python
from pycode2seq import Code2Seq

model = Code2Seq.load("kt_java")
method_embeddings = model.methods_embeddings("File.kt")
```

Pretrained Java and Kotlin common model will be downloaded automatically.

#### Full functionality
```python
import sys
from pycode2seq import Code2Seq

def main(argv):
    model = Code2Seq.load("kt_java")

    # Dictionary of method names with their embeddings
    method_embeddings = model.methods_embeddings("File.kt", "kt") 

    #Code2seq predictions
    predictions = model.run_on_file(argv[1], "kt")

    #Predicted method names
    names = [model.prediction_to_text(prediction) for prediction in predictions]

if __name__ == "__main__":
    main(sys.argv)
```

### Available models

- Java (`java`)
- Kotlin (`kt` or `kotlin`)
- Java & Kotlin (`kt_java`)

`kt_java` is compatible with `java` model and should have the same embeddings.
`kotlin` model is a part of `kt_java` model, so they are compatible too.

So you can use the common `kt_java` model and get **embeddings in one vector space for both languages**.

## Training

Download astminer and run:

```shell
./gradelw shadowJar
```

Mine projects for paths:

```shell
python training/mine_projects.py <data folder> <output folder> <path to astminer's cli.sh>
```

Combine mined paths:

```shell
python training/astminer_to_code2seq.py <data folder/holdout> <output folder> <holdout>
```

Build vocabulary with build_vocabulary.py from code2seq module

Combine vocabularies:

```shell
python training/combine_vocabularies.py
```

Expand weights:

```shell
python training/expand_weights.py
```

## Using speedy-antlr-tool

You can use [speedy-antlr](https://speedy-antlr-tool.readthedocs.io/en/latest/example.html) to speed up file parsing speed.

Clone and install [modified example](https://github.com/amykyta3/speedy-antlr-example).

Replace parser call with:

```python
stream = antlr4.FileStream(input_file)
tree = sa_kotlin.parse(stream, "kotlinFile", sa_kotlin.SA_ErrorListener())
```

You still need lexer to recover token values, though.

Note, that to make Java parser you will need to follow speedy-antlr tutorial and make another package.

## Using astminer to parse files

Clone [astminer fork with kotlin support](https://github.com/kisate/astminer) and run

```shell
./gradlew shadowJar
```

Extract methods with ```cli.sh``` arguments and usage can be found in ```training/mine_projects.py```.

Pass path to folder with csvs to ```run_model_on_astminer_csv()```.


