Metadata-Version: 2.1
Name: protein-explorer
Version: 0.1.0
Summary: A package for exploring protein structures and phosphorylation sites
Home-page: https://github.com/class-account/protein-explorer
Author: David Vanderwall
Author-email: dvanderwall@hms.harvard.edu
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Provides-Extra: database
Provides-Extra: dev

# KinoPlex: A Comprehensive Structural Atlas of the Human Phosphotome

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Version](https://img.shields.io/badge/Version-1.0.0-green.svg)](VERSION)

## Unprecedented Coverage of the Human Phosphoproteome

KinoPlex represents a landmark achievement in phosphoproteomics, providing the most comprehensive structural characterization of phosphorylation sites to date:

- **80,000+ experimentally validated phosphosites** meticulously mapped and structurally characterized, surpassing the coverage of any existing resource including PhosphoSitePlus (Hornbeck et al., 2015) and PhosphoPep (Bodenmiller et al., 2008)
- **Complete structural and sequence analysis of all human serine, threonine, and tyrosine (STY) sites** across the proteome, building upon but substantially extending databases like Phospho3D (Zanzoni et al., 2011)
- **6.4 billion+ sequence and structural comparisons** performed to identify novel relationships, representing an unprecedented computational undertaking in the field of phosphoproteomics
- **10,000 AlphaFold structures** locally hosted for optimized performance, leveraging the revolutionary advances in protein structure prediction (Jumper et al., 2021)
- **Millisecond-level query response** through Google Cloud SQL architecture, vastly outperforming the access speeds of traditional resources such as UniProt or the Protein Data Bank

## Project Overview

KinoPlex is the first comprehensive structural atlas of the entire human phosphotome, revolutionizing how researchers visualize, analyze, and understand protein phosphorylation through integration of three-dimensional structural data with extensive sequence information. The resource bridges the long-standing gap between sequence-based phosphosite analyses and structural biology, two disciplines that have historically developed in parallel with limited integration (Johnson & Lewis, 2001; Nishi et al., 2014).

## Scientific Significance

### Transforming Our Understanding of Phosphorylation Dynamics

Protein phosphorylation represents one of the most prevalent and dynamic post-translational modifications in eukaryotic cells, governing countless cellular processes from signal transduction to metabolic regulation, cell cycle control, and protein-protein interactions (Hunter, 2000; Ubersax & Ferrell, 2007). The significance of phosphorylation is underscored by the fact that approximately 30% of all human proteins are phosphorylated at some point during their functional lifecycle (Cohen, 2000), with dysregulation of phosphorylation implicated in numerous pathologies including cancer, neurodegeneration, and metabolic disorders (Lahiry et al., 2010; Blume-Jensen & Hunter, 2001).

Historically, phosphorylation research has been dominated by sequence-based approaches, with tools such as Scansite (Obenauer et al., 2003), GPS (Xue et al., 2008), and NetPhos (Blom et al., 1999) leveraging machine learning algorithms to identify sequence motifs associated with specific kinases. While these approaches have yielded valuable insights, they fundamentally fail to capture the three-dimensional context in which phosphorylation occurs.

KinoPlex addresses this critical limitation by integrating comprehensive structural data for tens of thousands of phosphosites, revealing patterns and relationships that remain invisible to traditional sequence-only analyses. This integration is particularly significant given recent research demonstrating that kinase specificity is significantly influenced by structural elements beyond the immediate sequence vicinity (Duarte et al., 2014; Creixell et al., 2015).

### Beyond Current Methodological Approaches

While pioneering resources such as PhosphoSitePlus (Hornbeck et al., 2015) have catalogued thousands of experimentally validated phosphosites, and innovative platforms like NetPhosPhos3D (Durek et al., 2009) and Phospho3D (Zanzoni et al., 2011) have attempted to incorporate structural elements, these resources have been limited by:

1. Incomplete coverage of the phosphoproteome (typically <30,000 sites)
2. Reliance on limited experimental structural data (<30% of human proteins)
3. Inability to effectively integrate sequence conservation with structural information
4. Computational limitations preventing comprehensive all-against-all structural comparisons

KinoPlex transcends these limitations through:

- **Comprehensive Integration of AlphaFold Structures**: Leveraging the revolutionary advances in protein structure prediction (Jumper et al., 2021; Varadi et al., 2022), KinoPlex incorporates AlphaFold models for the entire human proteome, enabling structural analysis of phosphosites regardless of experimental structure availability.

- **Novel Structural Comparison Algorithms**: Building upon methodologies pioneered by Creixell et al. (2015) and Xiao et al. (2020), KinoPlex implements advanced algorithms for phosphosite structural comparison that consider not only the immediate residue environment but also long-range structural elements that may influence kinase recognition.

- **Integration of Multi-omics Data**: Unlike existing resources that typically focus on either sequence or structure in isolation, KinoPlex seamlessly integrates phosphoproteomic, structural, evolutionary, and functional data, enabling researchers to explore the complex interrelationships between these different dimensions.

- **Application of Graph-Based Network Analysis**: Drawing inspiration from recent advances in network biology (Barabási et al., 2011; Bader & Hogue, 2003), KinoPlex implements sophisticated graph algorithms to identify clusters of structurally similar phosphosites that may share functional characteristics despite sequence divergence.

### Research Applications and Clinical Implications

The unprecedented coverage and integration of structural data in KinoPlex enable novel research approaches with significant implications for both basic science and clinical applications:

1. **Refined Kinase-Substrate Prediction**: By incorporating structural context, KinoPlex substantially improves the accuracy of kinase-substrate predictions compared to sequence-only methods (improving upon approaches described by Kobe et al., 2005 and Patrick et al., 2017), potentially accelerating the discovery of novel regulatory mechanisms.

2. **Drug Discovery Applications**: Understanding the structural context of phosphosites is critical for the design of phosphorylation-modulating therapeutics (Hopkins & Groom, 2002). KinoPlex provides structural insights that can guide the development of novel kinase inhibitors with enhanced specificity profiles.

3. **Evolutionary Analysis of Phosphosites**: The comprehensive structural mapping enables unprecedented evolutionary analyses of phosphorylation sites across species (extending work by Landry et al., 2009 and Studer et al., 2016), revealing conserved structural motifs that may not be apparent from sequence alone.

4. **Cancer Mutation Analysis**: Building upon seminal work in cancer phosphoproteomics (Reimand & Bader, 2013; Creixell et al., 2015), KinoPlex facilitates the systematic analysis of how cancer-associated mutations may disrupt phosphorylation networks through structural perturbations.

5. **Systems Biology Integration**: The structural atlas provides a foundation for integrating phosphorylation data into systems-level analyses of cellular signaling networks (Jørgensen & Linding, 2010; Terfve & Saez-Rodriguez, 2012), enabling more accurate modeling of complex cellular processes.

## Technical Infrastructure

KinoPlex combines cutting-edge web technologies with sophisticated computational biology to deliver an unparalleled resource for phosphoproteomics research:

- **Comprehensive Data Integration**: Seamlessly interfaces with PhosphositePlus (Hornbeck et al., 2015), UniProt (UniProt Consortium, 2021), and AlphaFold (Varadi et al., 2022) databases, aggregating and harmonizing data from these disparate sources to create a unified resource.

- **Cloud Architecture**: Deployed on Google Cloud with Cloud SQL backend, implementing sophisticated database optimization techniques inspired by advances in bioinformatics data management (Pavlidis & Noble, 2003; Oinn et al., 2004) to enable rapid querying of massive datasets.

- **Local Structure Repository**: 10,000 AlphaFold structures hosted for instant access, utilizing efficient structural data compression and retrieval methods based on recent advances in macromolecular structural informatics (Rose et al., 2016; Bakan et al., 2014).

- **Advanced Visualization**: Interactive 3D visualization of phosphosite environments leveraging state-of-the-art molecular visualization libraries and custom-developed tools for highlighting functionally relevant structural features (extending approaches pioneered by Rose & Hildebrand, 2015 and Rego & Koes, 2015).

- **Massive Computational Analysis**: 6.4 billion comparisons representing the largest computational analysis of phosphosite structures performed to date, employing distributed computing techniques and optimized algorithms for structural comparison based on methodologies described by Konc & Janežič (2017) and Gao & Skolnick (2013).

## Key Features

- **Interactive Structural Viewer**: Dynamically visualize protein structures with highlighted phosphosites, incorporating features for analyzing local structural environments inspired by approaches described in Tiberti et al. (2014) and Magnan et al. (2014).

- **Complete Phosphosite Atlas**: Navigate all 80,000+ human phosphosites with customizable filters, building upon and substantially extending the cataloguing approaches pioneered by resources such as PhosphoSitePlus (Hornbeck et al., 2015) and PHOSIDA (Gnad et al., 2011).

- **Structural Similarity Networks**: Discover relationships between phosphosites based on 3D environment, implementing novel network visualization techniques inspired by recent advances in biological network representation (Merico et al., 2009; Shannon et al., 2003).

- **Sequence Motif Analysis**: Identify conserved patterns with structural context, integrating methodologies from both sequence motif discovery (Bailey et al., 2009; Schwartz & Gygi, 2005) and structural motif identification (Nadzirin & Firdaus-Raih, 2012; Jonassen et al., 2000).

- **Kinase Prediction Engine**: Predict potential kinases using both sequence and structural information, building upon but significantly extending approaches described in recent literature (Wang et al., 2020; Chen et al., 2018; Wagih et al., 2016).

- **Comparative Analysis Tools**: Compare multiple phosphosites across different proteins, implementing sophisticated structural alignment techniques (Yang & Honig, 2000; Holm & Sander, 1995) optimized specifically for phosphosite comparisons.

- **Experimental Data Integration**: Correlate structural insights with published findings, drawing from methodologies for integrating heterogeneous biological data described by Lapatas et al. (2015) and Gomez-Cabrero et al. (2014).

- **Batch Query Processing**: Analyze multiple sites simultaneously for high-throughput research, implementing efficient parallel processing techniques inspired by advances in big data analytics for bioinformatics (O'Driscoll et al., 2013; Marx, 2013).


### Core Components

1. **Flask Web Application** (`web_app/app.py`): The main server that handles HTTP requests and serves the application
2. **Analysis Modules** (`protein_explorer/analysis/`): Core analytical functionality for proteins and phosphorylation sites
3. **Data Modules** (`protein_explorer/data/`): Handles data loading, caching, and retrieval from external sources
4. **Visualization Modules** (`protein_explorer/visualization/`): Generates visualizations for protein structures and networks
5. **Templates** (`web_app/templates/`): HTML templates for the user interface
6. **Static Files** (`web_app/static/`): JavaScript, CSS, and other static assets

### Data Flow

1. User submits a query (protein identifier or phosphosite)
2. Flask app routes the request to the appropriate handler
3. Data is retrieved from external sources (UniProt, AlphaFold) or local databases
4. Analysis modules process the data
5. Results are visualized through templates and client-side JavaScript
6. Interactive visualizations are rendered in the browser

# Protein Explorer

A comprehensive tool for exploring protein structures, analyzing phosphorylation sites, and predicting kinase interactions through structural and sequence similarities.

## Installation and Setup

### Prerequisites

* Python 3.8 or higher
* At least 4GB RAM for loading large data files
* 16GB disk space for caching and data storage
* Internet connection for retrieving protein data from external APIs

### Installation Steps

#### 1. Clone the Repository

```bash
git clone https://github.com/yourusername/protein-explorer.git
cd protein-explorer
```

#### 2. Set Up a Virtual Environment

**On Linux/MacOS:**
```bash
python3 -m venv venv
source venv/bin/activate
```

**On Windows:**
```bash
python -m venv venv
venv\Scripts\activate
```

#### 3. Install Python Dependencies

```bash
# Update pip first
pip install --upgrade pip

# Install required packages
pip install -r requirements.txt

# Install the package in development mode
pip install -e .
```

#### 4. Download Required Data Files (Only If Not Planning To Run In Database Mode)

Download these essential data files and place them in the project root directory:

| File | Purpose |
|------|---------|
| `Combined_Kinome_10A_Master_Filtered_2.feather` | Structural similarity database |
| `Sequence_Similarity_Edges.parquet` | Sequence similarity database |
| `PhosphositeSuppData.feather` | Phosphosite supplementary data |
| `Structure_Kinase_Scores.feather` | Structure-based kinase scores |
| `Sequence_Kinase_Scores.feather` | Sequence-based kinase scores |

#### 5. Create Required Directories

```bash
# Create cache directory
mkdir -p ~/.protein_explorer/cache
```

#### 6. Start the Flask Development Server

```bash
# Run the Flask app
python web_app/app.py
```

This will start the development server on http://127.0.0.1:5000/

#### 7. Verify Installation

Open your browser and navigate to http://127.0.0.1:5000/

### Database Integration (Optional)

For improved performance with large datasets:

1. Install database connector:
   ```bash
   pip install pymysql sqlalchemy
   ```

2. Configure database connection in `.env`:
   ```
   USE_DATABASE=True
   DB_HOST=your-database-host
   DB_PORT=3306
   DB_USER=your-username
   DB_PASS=your-password
   DB_NAME=kinoplex-db
   ```

## Troubleshooting

### Missing Data Files

If you see `FileNotFoundError: Structural similarity data file not found`:
- Verify all required data files are in the project root directory
- Check that file names match exactly (case-sensitive)

### Memory Errors

If you encounter `MemoryError` when loading large data files:
- Ensure your system has sufficient available RAM
- Consider enabling swap space on limited-memory systems
- Try using database mode for large datasets

### ImportError or ModuleNotFoundError

If modules can't be found:
- Ensure your virtual environment is activated
- Verify all dependencies are installed: `pip list`
- Reinstall the package: `pip install -e .`

### Browser Visualization Issues

If visualizations don't appear:
- Check browser console for JavaScript errors
- Ensure your browser supports modern JavaScript features
- Try a different browser (Chrome or Firefox recommended)

## Updating the Application

```bash
# Pull the latest changes
git pull

# Update dependencies
pip install -r requirements.txt

# Clear the cache
rm -rf ~/.protein_explorer/cache
mkdir -p ~/.protein_explorer/cache
```

## Development Setup

For contributing to the project:

1. Install development dependencies:
   ```bash
   pip install -r requirements-dev.txt
   ```

2. Set up pre-commit hooks:
   ```bash
   pre-commit install
   ```

3. Run tests:
   ```bash
   pytest
   ```


## Core Modules

### Analysis Modules

- **enhanced_table.py**: Generates enhanced HTML tables for phosphosite visualization
- **kinase_predictor.py**: Predicts kinases for phosphorylation sites based on similarity
- **networks.py**: Analyzes protein interaction networks using linear algebra
- **phospho.py**: Analyzes potential phosphorylation sites in proteins
- **phospho_analyzer.py**: Handles phosphosite structural analysis and comparisons
- **sequence_analyzer.py**: Analyzes sequence similarity between phosphorylation sites
- **structure.py**: Functions for analyzing protein structures using linear algebra

### Web Application (app.py)

The main application file (`web_app/app.py`) contains:

- Flask route definitions for all pages
- API endpoints for data retrieval
- Data preprocessing and integration logic
- Template rendering with appropriate context data

### Key Routes

- `/`: Home page
- `/search`: Protein search page
- `/protein/<identifier>`: Detailed protein information page
- `/site/<uniprot_id>/<site>`: Detailed phosphorylation site page
- `/phosphosite`: Phosphosite structural analysis page
- `/site-search`: Search page for specific phosphorylation sites
- `/analyze`: Tool for analyzing multiple proteins
- `/faq`: Frequently asked questions

### API Endpoints

- `/api/phosphosites/<uniprot_id>`: Get phosphorylation site data for a protein
- `/api/structure/<uniprot_id>`: Get structure information for a protein
- `/api/network/<uniprot_id>`: Get protein interaction network data
- `/api/sequence_matches/<site_id>`: Get sequence similarity matches for a phosphosite
- `/api/sequence_conservation/<site_id>`: Get sequence conservation analysis for a phosphosite
- `/api/kinases/<site_id>`: Get kinase prediction scores for a phosphosite
- `/api/kinases/compare`: Compare kinase predictions across multiple sites

## Frontend Components

### Templates

The application uses a series of HTML templates located in `web_app/templates/`:

- **index.html**: Home page with search functionality and feature overview
- **protein.html**: Detailed protein information page with structure visualization
- **site.html**: Detailed phosphorylation site analysis page
- **site_structural_section.html**: Tab-based structural analysis section
- **site_sequence_section.html**: Tab-based sequence analysis section
- **combined_kinase_tab.html**: Combined kinase analysis tab content
- **structural_network_script.html**: JavaScript for structural network visualization
- **sequence_network_script.html**: JavaScript for sequence network visualization

### JavaScript Components

- **kinase_prediction.js**: Visualizes kinase prediction results with charts
- **phosphosite-visualization.js**: Handles visualization of phosphosites in protein context

### Visualization Techniques

- **3D Protein Structures**: Uses NGL Viewer for 3D molecular visualization
- **Network Graphs**: Uses D3.js for interactive network visualizations
- **Charts and Plots**: Uses Chart.js for kinase prediction visualizations
- **Heatmaps**: D3.js-based heatmaps for multi-site comparisons
- **Interactive Tables**: Enhanced tables with filtering and sorting capabilities

### MySQL Integration (Remote SQL Database)

Protein Explorer uses a remote MySQL database hosted on Google Cloud SQL to manage and query structured data in real time. This database replaces earlier local file-based storage (e.g., .feather, .parquet) for certain datasets, enabling:
- Faster querying and analysis from the web app
- Centralized storage across different deployments
- Scalable infrastructure for large datasets

#### How it works
All SQL interactions are managed via a secure SQLAlchemy-based connector (cloud_sql_connector.py). The app uses in-memory caching (via Python dictionaries) to store recent query results during a session. There is no reliance on local disk caching anymore—only remote SQL and memory. If access to the remote database fails to be established, the application falls back to the local disk caching system.

### Data Loading and Caching

The application implements efficient data loading and caching strategies:

1. **Remote SQL Queries**: Primary source of all structured data
2. **In-Memory Caching**: Frequently accessed query results are cached in memory
3. **Progressive Loading**: Large datasets are loaded incrementally and on-demand

If the remote database access fails, the application falls back to the local caching systems: 

1. **Preloading**: Critical data is preloaded at application startup
2. **Local Caching**: Downloaded structures and data are cached locally
3. **In-Memory Caching**: Frequently accessed data is cached in memory
4. **Progressive Loading**: Data is loaded progressively as needed

## Key Analysis Features

### Phosphosite Identification

Identifies potential phosphorylation sites (S, T, Y residues) in protein sequences and analyzes their structural contexts. For each phosphosite, the application calculates:

- Mean pLDDT score (confidence metric from AlphaFold)
- Number of nearby residues within 10Å
- Surface accessibility
- Sequence motif (-7 to +7 positions)
- Comparison with known phosphorylation sites

### Structural Similarity Analysis

Compares the 3D structure of phosphorylation sites to identify similar binding patterns:

- RMSD-based structural similarity calculations
- Interactive network visualization of similar sites
- Filtering by RMSD threshold
- Detailed comparison of structural features

### Sequence Similarity Analysis

Analyzes sequence patterns around phosphorylation sites:

- Sequence similarity scoring
- Motif conservation analysis
- Position-specific amino acid distributions
- N-terminal and C-terminal region analysis

### Kinase Prediction

Predicts potential kinases for phosphorylation sites based on:

1. **Structure-based prediction**: Using structural similarity to known kinase substrates
2. **Sequence-based prediction**: Using sequence patterns recognized by specific kinases
3. **Combined analysis**: Integration of both approaches for more robust predictions

## Contributing

Contributions to Protein Explorer are welcome! Please follow these steps:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

[MIT License](LICENSE)

## Acknowledgements

- AlphaFold team for providing protein structure predictions
- UniProt for comprehensive protein data
- NGL Viewer for molecular visualization
- D3.js and Chart.js for data visualization
- Bootstrap for UI components



