Metadata-Version: 2.1
Name: protein-information-system
Version: 1.5.1
Summary: Comprehensive Python Module for Protein Data Management: Designed for streamlined integration and processing of protein information from both UniProt and PDB. Equipped with features for concurrent data fetching, robust error handling, and database synchronization.
Author: frapercan
Author-email: frapercan1@alum.us.es
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: bio (>=1.8.0,<2.0.0)
Requires-Dist: gemmi (>=0.7.3,<0.8.0)
Requires-Dist: h5py (>=3.12.1,<4.0.0)
Requires-Dist: mini3di (>=0.2.1,<0.3.0)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pgvector (>=0.4,<0.5)
Requires-Dist: pika (>=1.3.2,<2.0.0)
Requires-Dist: psycopg2-binary (>=2.9.9,<3.0.0)
Requires-Dist: pyyaml (>=6.0.1,<7.0.0)
Requires-Dist: retry (>=0.9.2,<0.10.0)
Requires-Dist: sentencepiece (>=0.2.0,<0.3.0)
Requires-Dist: sqlalchemy (>=2.0.40,<3.0.0)
Requires-Dist: tokenizer (>=3.4.3,<4.0.0)
Requires-Dist: torch (>=2.3.0,<3.0.0)
Requires-Dist: transformers (>=4.48.1,<5.0.0)
Description-Content-Type: text/markdown

[![PyPI - Version](https://img.shields.io/pypi/v/protein-information-system)](https://pypi.org/project/protein-information-system/)
[![Documentation Status](https://readthedocs.org/projects/protein-information-system/badge/?version=latest)](https://protein-information-system.readthedocs.io/en/latest/?badge=latest)
![Linting Status](https://github.com/CBBIO/protein-information-system/actions/workflows/test-lint.yml/badge.svg?branch=main)
[![codecov](https://codecov.io/gh/CBBIO/protein-information-system/branch/main/graph/badge.svg)](https://codecov.io/gh/CBBIO/protein-information-system)

# **Protein Information System (PIS)**

**Protein Information System (PIS)** is an integrated biological information system focused on extracting, processing, and managing protein-related data. PIS consolidates data from **UniProt**, **PDB**, and **GOA**, enabling the efficient retrieval and organization of protein sequences, structures, and functional annotations.

The primary goal of PIS is to provide a robust framework for large-scale protein data extraction, facilitating downstream functional analysis and annotation transfer. The system is designed for **high-performance computing (HPC) environments**, ensuring scalability and efficiency.


## 📈 **Current State of the Project**

### **FANTASIA: Functional Annotation Toolkit**


> 🧠 **FANTASIA** was built on top of the Protein Information System (PIS) as an advanced tool for **functional protein annotation** using embeddings generated by protein language models.
>
> [🔗 FANTASIA Repository](https://github.com/CBBIO/FANTASIA)
>
> The pipeline supports high-performance computing (HPC) environments and integrates tools such as ProtT5, ESM, and CD-HIT. These models can be extended or replaced with new variants **without modifying the core software structure**, simply by adding the new model to the PIS. This design enables scalable, modular, and reproducible GO term annotation from FASTA sequence files.


### **Protocol for Large-Scale Metamorphism and Multifunctionality Search**

> 🔍 In addition, a systematic protocol has been developed for the **large-scale identification of structural metamorphisms** and **protein multifunctionality**.
>
> [🔗 Metamorphic and multifunctionality Search Repository](https://github.com/CBBIO/metamorphic_multifunctional_search)
> 
> This protocol leverages the full capabilities of PIS to uncover non-obvious relationships between structure and function. **Structural metamorphisms** are detected by filtering large-scale structural alignments between proteins with high sequence identity, identifying divergent conformations. **Multifunctionality** is addressed through a semantic analysis of GO annotations, computing a functional distance metric to determine the two most divergent terms within each GO category per protein.

---

## **Prerequisites**

- Python 3.11.6
- RabbitMQ
- PostgreSQL with pgVector extension installed.

---

## **Setup Instructions**

### 1. Install Docker
Ensure Docker is installed on your system. If it’s not, you can download it from [here](https://docs.docker.com/get-docker/).

### 2. Starting Required Services

Ensure PostgreSQL and RabbitMQ services are running.

```bash
docker run -d --name pgvectorsql \
    --shm-size=1g \
    -e POSTGRES_USER=usuario \
    -e POSTGRES_PASSWORD=clave \
    -e POSTGRES_DB=BioData \
    -p 5432:5432 \
    pgvector/pgvector:pg16 
```
> ⚠️ Set `--shm-size=1g` or higher to avoid performance issues.



### 4. (Optional) Connect to the Database

You can use **pgAdmin 4**, a graphical interface for managing and interacting with PostgreSQL databases, or any other SQL client.

### 5. Set Up RabbitMQ

Start a RabbitMQ container using the command below:

```bash
docker run -d --name rabbitmq \
    -p 15672:15672 \
    -p 5672:5672 \
    rabbitmq:management
```

### 6. (Optional) Manage RabbitMQ

Once RabbitMQ is running, you can access its management interface at [RabbitMQ Management Interface](http://localhost:15672/#/queues).

---

## **Get started:**

To execute the full extraction process, simply run:

```bash
python main.py
```

This command will trigger the complete workflow, starting from the initial data preprocessing stages and continuing through to the final data organization and storage.

## **Customizing the Workflow:**

You can customize the sequence of tasks executed by modifying `main.py` or adjusting the relevant parameters in the `config.yaml` file. This allows you to tailor the extraction process to meet specific research needs or to experiment with different data processing configurations.


