Metadata-Version: 2.1
Name: oligoseeker
Version: 0.0.6
Summary: A Python tool for processing paired FASTQ files to efficiently count oligo codons.
Home-page: https://github.com/mtinti/OligoSeeker
Author: mtinti
Author-email: michele.tinti@gmail.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: biopython
Requires-Dist: fastcore
Requires-Dist: pandas
Requires-Dist: tqdm
Provides-Extra: dev

# OligoSeeker


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

### A Python library for processing FASTQ files to count oligo codons

> [![DOI](https://zenodo.org/badge/946567115.svg)](https://doi.org/10.5281/zenodo.15011916)

### Reference

> https://mtinti.github.io/OligoSeeker/

## Installation

You can install the package via pip:

``` bash
pip install oligoseeker
```

Or directly from the repository:

``` bash
pip install git+https://github.com/username/OligoSeeker.git
```

## Overview

OligoSeeker is a Python library designed to process paired FASTQ files
and count occurrences of specific oligo codons. It provides a simple yet
powerful interface for bioinformatics researchers working with
oligonucleotide analysis.

## Features

- Process paired FASTQ files (gzipped or uncompressed)
- Search for custom oligo sequences with codon sites (NNN)
- Support for both forward and reverse complement matching
- Comprehensive results in CSV format
- Merge functionality to combine results from multiple samples
- User-friendly command-line interface with multiple modes
- Modular design for integration with other tools

## Scientific Background: Oligonucleotide-Targeted Mutagenesis

Oligonucleotide-targeted mutagenesis is a powerful technique in
molecular biology that enables precise alterations of DNA sequences. In
this approach, synthetic oligonucleotides (short DNA fragments,
typically 20-60 nucleotides) are designed to target specific locations
in a gene, allowing researchers to introduce defined mutations.

### The Structure of Mutagenic Oligos

A typical mutagenic oligo has three distinct components:

1.  **5’ Homology Arm**: A sequence that matches the target DNA upstream
    of the mutation site, providing specificity.
2.  **Mutation Site (NNN)**: The actual mutation being introduced, often
    represented as “NNN” when a mixture of all possible codons is used.
3.  **3’ Homology Arm**: A sequence that matches the target DNA
    downstream of the mutation site, providing additional specificity.

For example, if our target DNA sequence is:

    5'-ATGCATGCATGCATGCATGCATGCATGCATGC-3'

And we want to mutagenize the underlined codon:

    5'-ATGCATGCATGCAT___GCATGCATGCATGCATGC-3'

We would design an oligo like:

    5'-ATGCATGCATGCATNNNGCATGCATGCATGC-3'

### Why Use NNN Codons?

The “NNN” in the oligo represents a mixture of all possible nucleotide
combinations at that position: - N = A mixture of A, T, G, and C - NNN =
All 64 possible codons (4³ = 64)

This approach allows: - **Saturation mutagenesis**: Testing all possible
amino acid substitutions at a position - **Structure-function studies**:
Identifying critical residues in proteins - **Protein engineering**:
Optimizing enzyme activity or stability

### Deep Sequencing of Mutagenesis Libraries

After the mutagenesis reaction, the resulting DNA library contains a
mixture of variants with different codons at the target position.
Next-generation sequencing technologies allow researchers to sequence
thousands or millions of these variants simultaneously.

`OligoSeeker` helps analyze this sequencing data by: 1. Identifying
reads that contain the mutagenic oligo 2. Extracting the specific codon
present at the NNN position 3. Counting the frequency of each codon
variant

This information is crucial for: - Verifying library coverage (were all
possible codons incorporated?) - Quantifying biases in the mutagenesis
process - Analyzing selection experiments where certain variants may be
enriched

## How It Works

OligoSeeker searches for specific oligonucleotide patterns in paired
FASTQ reads. When it finds a match, it extracts the codon sequence
(represented by NNN in the oligo pattern) and tallies its occurrence.
The library handles both forward and reverse complement matching,
ensuring comprehensive detection.

The basic count workflow is: 1. Load and validate oligo sequences 2.
Process paired FASTQ files 3. Count codon occurrences for each oligo 4.
Output results in CSV format

Additionally, the merge workflow allows you to: 1. Process multiple
samples independently 2. Combine the count results from different runs
3. Sum the codon occurrences across samples 4. Analyze patterns across a
larger dataset

## Performance and Compatibility

OligoSeeker has been tested on both Linux and macOS platforms

- **Test Case**: 1 oligo (33 bp) analyzed in 150 bp paired-end FASTQ
  files containing 300 million reads
- **Processing Time**:
  - ~1 hour on a high-performance compute cluster
  - ~1.5 hours on a standard MacBook Pro

### Scalability

For large datasets, we’ve implemented an efficient workflow to
significantly increase throughput:

1.  **File Splitting**: Large FASTQ files are split into smaller chunks
    using [seqkit](https://bioinf.shenwei.me/seqkit/), a
    high-performance toolkit for FASTA/Q file manipulation
2.  **Parallel Processing**: OligoSeeker is applied in parallel to each
    chunk independently
3.  **Result Merging**: Individual results are merged using
    OligoSeeker’s built-in merge functionality

## Quick Start

### Command-Line Usage

``` bash
# Basic usage with oligos
!oligoseeker -m count \
--f1 ../test_files/test_1.fq.gz \
--f2 ../test_files/test_2.fq.gz \
--oligos "GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT" \
--output ../test_files/test_outs --prefix test_cm3

# Basic usage with oligos files
oligoseeker -m count \
--f1 ../test_files/test_1.fq.gz \
--f2 ../test_files/test_2.fq.gz \
--oligos-file '../test_files/oligos.txt' \
--output ../test_files/test_outs --prefix test_cm4

# Basic usage to merge oligo counts
oligoseeker -m merge \
--output-file 'merge_cl.csv' \
--input-dir ../test_files/test_outs \
--output ../test_files/merged 
```

### Python API Usage

Here’s a simple example of using the Python API:

``` python
from OligoSeeker.pipeline import PipelineConfig, OligoCodonPipeline
from typing import Dict, List, Tuple, Set
# Create a configuration
config = PipelineConfig(
    fastq_1="../test_files/test_1.fq.gz",
    fastq_2="../test_files/test_1.fq.gz",
    oligos_list=["GCGGATTACATTNNNAAATAACATCGT", "TGTGGTAAGCGGNNNGAAAGCATTTGT", "GTCGTAGAAAATNNNTGGGTGATGAGC"],
    output_path="../test_files/test_outs",
    output_prefix='test1'
)

# Create and run the pipeline
pipeline = OligoCodonPipeline(config)
results = pipeline.run()

# Print the locations of output files
print(f"Results saved to: {results['csv_path']}")
```

    /Users/MTinti/miniconda3/envs/work3/lib/python3.10/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.4' currently installed).
      from pandas.core import (
    2025-03-12 15:10:00,869 - INFO - Starting OligoCodonPipeline
    2025-03-12 15:10:00,869 - INFO - Loading oligo sequences...
    2025-03-12 15:10:00,870 - INFO - Using provided oligo list
    2025-03-12 15:10:00,870 - INFO - Loaded 3 oligo sequences
    2025-03-12 15:10:00,871 - INFO - Processing FASTQ files...

    0it [00:00, ?it/s]

    2025-03-12 15:10:00,974 - INFO - Formatting results...
    2025-03-12 15:10:00,976 - INFO - Saving results to: ../test_files/test_outs/test1_counts.csv
    2025-03-12 15:10:01,000 - INFO - Pipeline completed in 0.13 seconds

    Results saved to: ../test_files/test_outs/test1_counts.csv

``` python
# this should show 20 (ACT), 40 (GGC) and 60 matches (AAA) for
# oligo 1, 2 and 3 respectievely
import pandas as pd
out = pd.read_csv(results['csv_path'],index_col=[0])
out.head()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | 1_GCGGATTACATTNNNAAATAACATCGT | 2_TGTGGTAAGCGGNNNGAAAGCATTTGT | 3_GTCGTAGAAAATNNNTGGGTGATGAGC |
|----|----|----|----|
| none | 1980.0 | 1960.0 | 1940.0 |
| ACT | 20.0 | 0.0 | 0.0 |
| GGC | 0.0 | 40.0 | 0.0 |
| AAA | 0.0 | 0.0 | 60.0 |

</div>

Here’s a simple example of using the Python API with oligo listed in a
file:

``` python
from OligoSeeker.pipeline import PipelineConfig, OligoCodonPipeline
from typing import Dict, List, Tuple, Set
# Create a configuration
config = PipelineConfig(
    fastq_1="../test_files/test_1.fq.gz",
    fastq_2="../test_files/test_1.fq.gz",
    oligos_file="../test_files/oligos.txt",
    output_path="../test_files/test_outs",
    output_prefix='test2'
)



# Create and run the pipeline
pipeline = OligoCodonPipeline(config)
results = pipeline.run()

# Print the locations of output files
print(f"Results saved to: {results['csv_path']}")
```

    2025-03-12 15:10:01,100 - INFO - Starting OligoCodonPipeline
    2025-03-12 15:10:01,101 - INFO - Loading oligo sequences...
    2025-03-12 15:10:01,101 - INFO - Loading oligos from file: ../test_files/oligos.txt
    2025-03-12 15:10:01,103 - INFO - Loaded 3 oligo sequences
    2025-03-12 15:10:01,103 - INFO - Processing FASTQ files...

    0it [00:00, ?it/s]

    2025-03-12 15:10:01,154 - INFO - Formatting results...
    2025-03-12 15:10:01,156 - INFO - Saving results to: ../test_files/test_outs/test2_counts.csv
    2025-03-12 15:10:01,160 - INFO - Pipeline completed in 0.06 seconds

    Results saved to: ../test_files/test_outs/test2_counts.csv

### Merging Count Files

You can merge multiple count files from different runs to combine
results:

``` python
from OligoSeeker.merge import merge_count_csvs

# Merge all count files in a directory
merged_df = merge_count_csvs(
    input_dir="../test_files/test_outs",  # Directory containing count files
    output_file="merged_counts.csv",      # Output filename
    output_dir="../test_files/merged",    # Output directory
    pattern="*_counts.csv"                # Pattern to match files
)

print(f"Merged {len(merged_df)} codons across {len(merged_df.columns)} oligos")
merged_df.head()
```

    Found 4 CSV files to merge
      Loaded ../test_files/test_outs/test2_counts.csv with 4 rows and 3 columns
      Loaded ../test_files/test_outs/test1_counts.csv with 4 rows and 3 columns
      Loaded ../test_files/test_outs/test_cm3_counts.csv with 4 rows and 3 columns
      Loaded ../test_files/test_outs/test_cm4_counts.csv with 4 rows and 3 columns
    Merged data saved to ../test_files/merged/merged_counts.csv
    Merged 4 codons across 3 oligos

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

|  | 1_GCGGATTACATTNNNAAATAACATCGT | 2_TGTGGTAAGCGGNNNGAAAGCATTTGT | 3_GTCGTAGAAAATNNNTGGGTGATGAGC |
|----|----|----|----|
| AAA | 0.0 | 0.0 | 240.0 |
| ACT | 80.0 | 0.0 | 0.0 |
| GGC | 0.0 | 160.0 | 0.0 |
| none | 7920.0 | 7840.0 | 7760.0 |

</div>

## Modules

OligoSeeker is organized into several modules:

### Core

The [core module](./core.html) contains fundamental utilities and
classes: - DNA sequence operations (reverse complement, etc.) -
OligoRegex for pattern matching - OligoLoader for loading and validating
oligo sequences

### FASTQ Processing

The [FASTQ module](./fastq.html) handles reading and processing FASTQ
files: - FastqHandler for file operations - OligoCodonProcessor for
counting codons in FASTQ files

### Output

The [output module](./output.html) manages results formatting and
saving: - ResultsFormatter for converting results to DataFrames -
ResultsSaver for saving to various file formats

### Pipeline

The [pipeline module](./pipeline.html) provides the complete processing
pipeline: - PipelineConfig for configuration settings - ProgressReporter
for progress tracking - OligoCodonPipeline for end-to-end processing

### Merge

The [merge module](./merge.html) provides functionality to combine
multiple count results: - Merge count CSV files by summing values -
Support for flexible output naming and location - Pattern matching to
select specific files

### CLI

The [CLI module](./cli.html) implements the command-line interface: -
Argument parsing - Configuration validation - Pipeline execution

## Quick Start

### Command-Line Usage

For count mode (processing FASTQ files):

``` bash
# Using oligos directly specified
oligoseeker -m count --f1 test_files/test_1.fq.gz --f2 test_files/test_2.fq.gz \
--oligos "GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT" \
--output test_outs --prefix test_run1

# Using oligos from a file
oligoseeker -m count --f1 test_files/test_1.fq.gz --f2 test_files/test_2.fq.gz \
--oligos-file test_files/oligos.txt --output test_outs --prefix test_run2
```

For merge mode (combining multiple count files):

``` bash
# Merge all count files in a directory
oligoseeker -m merge --input-dir test_outs --output test_outs/merged \
--output-file combined_counts.csv
```

## CLI Reference

``` bash
usage: oligoseeker [-h] [-m {count,merge}] [--f1 FASTQ_PATH_1] [--f2 FASTQ_PATH_2]
                  [--oligos-file OLIGOS_FILE] [--oligos OLIGOS_STRING]
                  [--offset OFFSET_OLIGO] [--input-dir INPUT_DIR]
                  [--output-file OUTPUT_FILE] [--pattern PATTERN]
                  [-o OUTPUT_PATH] [--prefix OUTPUT_PREFIX]
                  [--log-file LOG_FILE]
                  [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

OligoSeeker: Process FASTQ files to count oligo codons

options:
  -h, --help            show this help message and exit
  -m {count,merge}, --mode {count,merge}
                        Operation mode: 'count' to process FASTQ files or 'merge' to combine CSV counts (default: count)
  -o OUTPUT_PATH, --output OUTPUT_PATH
                        Output directory for results (default: ../test_files/test_outs)
  --prefix OUTPUT_PREFIX
                        Prefix for output files (default: )
  --log-file LOG_FILE   Path to log file (if not specified, logs to console only)
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Logging level (default: INFO)

Count Mode Options:
  --f1 FASTQ_PATH_1, --fastq_1 FASTQ_PATH_1
                        Path to FASTQ 1 file (default: ../test_fastq_files/test_1.fq.gz)
  --f2 FASTQ_PATH_2, --fastq_2 FASTQ_PATH_2
                        Path to FASTQ 2 file (default: ../test_fastq_files/test_2.fq.gz)

Oligo Source Options:
  --oligos-file OLIGOS_FILE
                        File containing oligo sequences (one per line)
  --oligos OLIGOS_STRING
                        Comma-separated list of oligo sequences
                        (default: GCGGATTACATTNNNAAATAACATCGT,TGTGGTAAGCGGNNNGAAAGCATTTGT,GTCGTAGAAAATNNNTGGGTGATGAGC)
  --offset OFFSET_OLIGO
                        Value to add to oligo index in output (default: 1)

Merge Mode Options:
  --input-dir INPUT_DIR
                        Directory containing CSV files to merge (required for merge mode)
  --output-file OUTPUT_FILE
                        Name of the output merged file (default: merged_counts.csv)
  --pattern PATTERN     Pattern to match CSV files (default: *count*.csv)
```

## Data Requirements

OligoSeeker works with standard paired FASTQ files, which should be
named according to common conventions:

- Read 1: `*_1.fq.gz`, `*_R1.fastq.gz`, or `*_R1_001.fastq.gz`
- Read 2: `*_2.fq.gz`, `*_R2.fastq.gz`, or `*_R2_001.fastq.gz`

The oligo sequences should include a codon site marked with `NNN`. For
example:

    GAACNNNCAT
    TGACNNNTAG

This specifies that the 3 bases following `GAAC` or `TGAC` should be
captured as the codon.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

### Development Setup

1.  Clone the repository

2.  Install development dependencies:

    ``` bash
    pip install -e ".[dev]"
    pip install nbdev
    ```

3.  Make changes to the notebook files in the `nbs` directory

4.  Build the library:

    ``` bash
    nbdev_build_lib
    ```

5.  Build the documentation:

    ``` bash
    nbdev_build_docs
    ```

## License

This project is licensed under the Apache 2.0 License - see the LICENSE
file for details.
