Metadata-Version: 2.1
Name: esco-skill-extractor
Version: 0.1.10
Summary: Extract ESCO skills from texts such as job descriptions or CVs
Home-page: https://github.com/KonstantinosPetrakis/esco-skill-extractor
Author: Konstantinos Petrakis
Author-email: konstpetrakis01@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Flask==3.0.3
Requires-Dist: Jinja2==3.1.4
Requires-Dist: MarkupSafe==3.0.1
Requires-Dist: PyYAML==6.0.2
Requires-Dist: Werkzeug==3.0.4
Requires-Dist: blinker==1.8.2
Requires-Dist: certifi==2024.8.30
Requires-Dist: charset-normalizer==3.4.0
Requires-Dist: click==8.1.7
Requires-Dist: filelock==3.16.1
Requires-Dist: fsspec==2024.9.0
Requires-Dist: huggingface-hub==0.25.2
Requires-Dist: idna==3.10
Requires-Dist: itsdangerous==2.2.0
Requires-Dist: joblib==1.4.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: networkx==3.4.1
Requires-Dist: numpy==2.1.2
Requires-Dist: packaging==24.1
Requires-Dist: pandas==2.2.3
Requires-Dist: pillow==10.4.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: pytz==2024.2
Requires-Dist: regex==2024.9.11
Requires-Dist: requests==2.32.3
Requires-Dist: safetensors==0.4.5
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: scipy==1.14.1
Requires-Dist: sentence-transformers==3.2.0
Requires-Dist: six==1.16.0
Requires-Dist: summa==1.2.0
Requires-Dist: sympy==1.13.3
Requires-Dist: threadpoolctl==3.5.0
Requires-Dist: tokenizers==0.20.1
Requires-Dist: torch==2.4.1
Requires-Dist: tqdm==4.66.5
Requires-Dist: transformers==4.45.2
Requires-Dist: typing-extensions==4.12.2
Requires-Dist: tzdata==2024.2
Requires-Dist: urllib3==2.2.3
Requires-Dist: waitress==3.0.0
Provides-Extra: cuda
Requires-Dist: nvidia-cublas-cu12==12.1.3.1; extra == "cuda"
Requires-Dist: nvidia-cuda-cupti-cu12==12.1.105; extra == "cuda"
Requires-Dist: nvidia-cuda-nvrtc-cu12==12.1.105; extra == "cuda"
Requires-Dist: nvidia-cuda-runtime-cu12==12.1.105; extra == "cuda"
Requires-Dist: nvidia-cudnn-cu12==9.1.0.70; extra == "cuda"
Requires-Dist: nvidia-cufft-cu12==11.0.2.54; extra == "cuda"
Requires-Dist: nvidia-curand-cu12==10.3.2.106; extra == "cuda"
Requires-Dist: nvidia-cusolver-cu12==11.4.5.107; extra == "cuda"
Requires-Dist: nvidia-cusparse-cu12==12.1.0.106; extra == "cuda"
Requires-Dist: nvidia-nccl-cu12==2.20.5; extra == "cuda"
Requires-Dist: nvidia-nvjitlink-cu12==12.6.77; extra == "cuda"
Requires-Dist: nvidia-nvtx-cu12==12.1.105; extra == "cuda"
Requires-Dist: triton==3.0.0; extra == "cuda"

# ESCO Skill Extractor

This is a a tool that extract **ESCO skills from texts** such as job descriptions or CVs. It uses a transformer and compares its embedding using cosine similarity.

## Installation

```bash
pip install esco-skill-extractor
```

or for Nvidia GPU acceleration:

```bash
pip install esco-skill-extractor[cuda]
```

## Usage

### Via python

```python
from esco_skill_extractor import SkillExtractor

# Don't be scared, the 1st time will take longer to download the model and create the embeddings.
skill_extractor = SkillExtractor()

ads = [
    "We are looking for a software engineer with experience in Java and Python.",
    "We are looking for a devops engineer. Containerization tools such as Docker is a must. AWS is a plus."
    # ...
]

print(skill_extractor.get_skills(ads))

# Output:
# [
#     [
#         "http://data.europa.eu/esco/skill/ccd0a1d9-afda-43d9-b901-96344886e14d"
#     ],
#     [
#         "http://data.europa.eu/esco/skill/f0de4973-0a70-4644-8fd4-3a97080476f4",
#         "http://data.europa.eu/esco/skill/ae4f0cc6-e0b9-47f5-bdca-2fc2e6316dce",
#     ],
# ]
```

### Via GUI

```bash
# Visit the URL printed in the console.
# run python -m esco_skill_extractor --help for more options.
python -m esco_skill_extractor 
```

<img src="docs/gui.gif">

### Via API

```bash
# Visit the URL printed in the console.
# run python -m esco_skill_extractor --help for more options.
python -m esco_skill_extractor 
```

```js
async function getSkills() {
    const texts = [
        "We are looking for a software engineer with experience in Java and Python.",
        "We are looking for a devops engineer. Containerization tools such as Docker is a must. AWS is a plus."
        // ...
    ];

    // Default host is localhost, and default port is 8000. Check CLI options for more.
    const response = await fetch("http://localhost:8000/extract", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
        },
        body: JSON.stringify(texts),
    });

    const skills = await response.json();
    console.log(skills);
    // Output:
    // [
    //     [
    //         "http://data.europa.eu/esco/skill/ccd0a1d9-afda-43d9-b901-96344886e14d"
    //     ],
    //     [
    //         "http://data.europa.eu/esco/skill/f0de4973-0a70-4644-8fd4-3a97080476f4",
    //         "http://data.europa.eu/esco/skill/ae4f0cc6-e0b9-47f5-bdca-2fc2e6316dce",
    //     ],
    // ]
}
```

## Possible keyword arguments for `SkillExtractor`

| Keyword Argument | Description                                                                                                            | Default |
| ---------------- | ---------------------------------------------------------------------------------------------------------------------- | ------- |
| threshold        | Skills surpassing this cosine similarity threshold are considered a match.                                             | 0.4     |
| device           | The device where the copulations will take place. E.g torch device.                                                    | "cpu"   |
| max_words        | If any sentence in the input surpasses the set word_length considerably, its summarized close to that number of words. | -1      |

## How it works

1. It creates embeddings from esco skills found in the official ESCO website.
2. It creates embeddings from the input text (one for each sentence).
   1. If any sentence surpasses the `max_words` limit, it is summarized to that number of words by using an [implementation of the TextRank algorithm](https://github.com/summanlp/textrank).
3. It compares the embeddings of the text with the embeddings of the ESCO skills using cosine similarity.
4. It returns the most similar esco skill per sentence if its similarity passes a predefined threshold.


