Metadata-Version: 2.1
Name: usum
Version: 0.1.3
Summary: USUM: Plotting sequence similarity using USEARCH & UMAP
Home-page: https://github.com/prihoda/usum
Author: David Příhoda
Author-email: david.prihoda@gmail.com
License: MIT
Keywords: dna,protein,sequence,similarity,umap,usearch,uclust,plot
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: biopython
Requires-Dist: umap-learn[plot]

# USUM: Plotting sequence similarity using USEARCH & UMAP

USUM uses [USEARCH](https://drive5.com/usearch/) and [UMAP](https://github.com/lmcinnes/umap) to plot DNA 🧬and protein 🧶 sequence similarity embeddings.

[![PyPI - Downloads](https://img.shields.io/pypi/dm/usum.svg?color=green&label=PyPI%20downloads)](https://pypi.python.org/pypi/usum/)
[![PyPI license](https://img.shields.io/pypi/l/usum.svg)](https://pypi.python.org/pypi/usum/)
[![PyPI version](https://badge.fury.io/py/usum.svg)](https://pypi.python.org/pypi/usum/)

## Installation

Install `USEARCH` manually: https://drive5.com/usearch/download.html 
<br>(consider supporting the author by buying the 64bit license)

Install `usum` using PIP:

```bash
pip install usum
```

## Usage

Use `usum` to plot input protein or DNA sequences in FASTA format.

**Note:** `USEARCH` is not built for long sequences

### Minimal example


```bash
usum example.fa --maxdist 0.2 --termdist 0.3 --output example
```

### Multiple input files with labels

```bash
usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output umap
```

This will produce a PNG plot:

![UMAP static example](docs/example1.png?raw=true "UMAP static example")

An interactive [Bokeh](https://bokeh.org) HTML plot is also created:

![UMAP Bokeh example](docs/example2.png?raw=true "UMAP Bokeh example")

### Plotting random subset

You can use `--limit` to extract and plot a random subset of the input sequences.

```bash
# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output umap
```

You can control randomness and reproducibility using the `--seed` option.

### Reusing previous results

When changing just the plot options, you can use `--resume` to reuse previous results from the output folder.

**Warning** This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

```bash
# Reuse result from umap output directory
usum --resume --output umap --width 600 --height 600 --theme fire
```

### Programmatic use

```python
from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)
```

## How it works

- A sparse distance matrix is calculated using USEARCH [calc_distmx](https://drive5.com/usearch/manual/cmd_calc_distmx.html) command. 
- The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
- The distance matrix is embedded as a `precomputed` metric using [UMAP](https://github.com/lmcinnes/umap) 
- The embedding is plotted using [umap.plot](https://umap-learn.readthedocs.io/en/latest/plotting.html).


