Metadata-Version: 2.1
Name: helen
Version: 0.0.6
Summary: RNN based assembly HELEN. It works paired with MarginPolish.
Home-page: https://github.com/kishwarshafin/helen
Author: Kishwar Shafin
Author-email: kishwar.shafin@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Requires-Python: >=3.5.*
Description-Content-Type: text/markdown
Requires-Dist: h5py
Requires-Dist: tqdm
Requires-Dist: numpy
Requires-Dist: wget
Requires-Dist: torch
Requires-Dist: torchvision
Requires-Dist: torchnet
Requires-Dist: pyyaml
Requires-Dist: onnx
Requires-Dist: onnxruntime
Requires-Dist: hyperopt
Requires-Dist: matplotlib

# H.E.L.E.N.
H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)


[![Build Status](https://travis-ci.com/kishwarshafin/helen.svg?branch=master)](https://travis-ci.com/kishwarshafin/helen)
___________________________________________________________
Pre-print of a paper describing the methods and overview of a suggested `de novo assembly` pipeline is now available:
#### [Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit](https://www.biorxiv.org/content/10.1101/715722v1)
__________________________________________________________

## Overview
`HELEN` is a polisher intended to use for polishing human-genome assemblies. `HELEN` operates on the pileup summary generated by [MarginPolish](https://github.com/UCSC-nanopore-cgl/marginPolish). `MarginPolish` uses a probabilistic graphical-model to encode read alignments through a draft assembly to find the maximum-likelihood consensus sequence. The graphical-model operates in run-length space, which helps to reduce errors in homopolymeric regions. `MarginPolish` can produce tensor-like summaries encapsulating the internal likelihood weights. The weights are assigned to each genomic position over multiple likely outcomes that is suitable for inference by a Deep Neural Network model.

`HELEN` uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by `MarginPolish`.

© 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten. <br/>
Computational Genomics Lab (CGL), University of California, Santa Cruz.

## Why MarginPolish-HELEN ?
* `MarginPolish-HELEN` outperforms other graph-based and Neural-Network based polishing pipelines.
* Easily usable via Docker for both `GPU` and `CPU`.
* Highly optimized pipeline that is faster than any other available polishing tool (~4 hours for `HELEN`).
* We have <b>sequenced-assembled-polished 11 samples</b> to ensure robustness, runtime-consistency and cost-efficiency.
* We tested GPU usage on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability.
* Open source [(MIT License)](LICENSE).

## Walkthrough
A `demo` walkthrough is available here: [demo](docs/walkthrough.md)

## Table of contents
* [Workflow](#workflow)
* [Installation](#Installation)
* [Usage](#Usage)
* [Models](#Models)
   * [Released Models](#Released-Models)
* [Runtime and Cost](#Runtime-and-Cost)
* [Results](#Results)
* [Eleven high-quality assemblies](#Eleven-high-quality-assemblies)
* [Help](#Help)
* [Acknowledgement](#Acknowledgement)

## Workflow

The workflow is as follows:
* Generate an assembly with [Shasta](https://github.com/chanzuckerberg/shasta).
* Create a mapping between reads and the assembly using [Minimap2](https://github.com/lh3/minimap2).
* Use [MarginPolish](https://github.com/UCSC-nanopore-cgl/marginPolish) to generate the images.
* Use HELEN to generate a polished consensus sequence.
<p align="center">
<img src="img/pipeline.svg" alt="pipeline.svg" height="640p">
</p>

## Installation
We have docker support for both `MarginPolish` and `HELEN`. Users can install `MarginPolish` and `HELEN` on <b>`Ubuntu 18.04`</b> or any other Linux-based system by following the instructions from our [Installation Guide](docs/installation.md).

If you have locally installed `MarginPolish-HELEN` then please follow the [Local Install Usage Guide](docs/usage_local_install.md)

## Usage
`MarginPolish` requires a draft assembly and a mapping of reads to the draft assembly. We commend using `Shasta` as the initial assembler and `MiniMap2` for the mapping.

#### Step 1: Generate an initial assembly
Although any assembler can be used to generate the initial assembly, we highly recommend using [Shasta](https://github.com/chanzuckerberg/shasta).

Please see the [quick start documentation](https://chanzuckerberg.github.io/shasta/QuickStart.html) to see how to use Shasta. Shasta requires memory intensive computing.
> For a human size assembly, AWS instance type x1.32xlarge is recommended. It is usually available at a cost around $4/hour on the AWS spot market and should complete the human size assembly in a few hours, at coverage around 60x.

An assembly can be generated by running:
```bash
# you may need to convert the fastq to a fasta file
./shasta-Linux-0.1.0 --input <reads.fa> --output <path_to_shasta_output>
```

#### Step 2: Create an alignment between reads and shasta assembly
We recommend using `MiniMap2` to generate the mapping between the reads and the assembly.
```bash
# we recommend using FASTQ as marginPolish uses quality values
# This command can run MiniMap2 with 32 threads, you can change the number as you like.
minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools sort -@ 32 | samtools view -hb -F 0x104 > reads_2_assembly.bam
samtools index -@32 reads_2_assembly.bam

#  the -F 0x104 flag removes unaligned and secondary sequences
```
#### Step 3: Generate images using MarginPolish
##### Run MarginPolish using docker
`MarginPolish` can be used in a docker container. You can get the image from:
```bash
docker pull kishwars/margin_polish:latest
docker run kishwars/margin_polish:latest --help
```

To generate images with `MarginPolish` docker, first collect all your input data (`shasta_assembly.fa, reads_2_assembly.bam, allParams.np.human.guppy-ff-235.json`) to a directory i.e. `</your/data/dir>`.
Then please run:
```bash
docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/margin_polish:latest reads_2_assembly.bam \
shasta_assembly.fa \
/opt/MarginPolish/params/<model_name.json> \
-t <number_of_threads> \
-o output/marginpolish_images \
-f
```

You can get the `params.json` from `path/to/marginpolish/params/allParams.np.human.guppy-ff-235.json`.

#### Step 4: Run HELEN

##### Download Model
Before running `call_consensus.py` please download the appropriate model suitable for your data. Please read our [model guideline](#Model) to understand which model to pick.

##### Get docker images (GPU)
Plase install `CUDA 10.0` to run the GPU supported docker for `HELEN`.
```bash
sudo apt-get install nvidia-docker2
sudo docker pull kishwars/helen:0.0.1.gpu
sudo nvidia-docker run kishwars/helen:0.0.1.gpu call_consensus.py -h
```

###### Run call_consensus.py
Please gather all your data to a input directory. Then run `call_consensus.py` using the following command:
```bash
sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu call_consensus.py \
-i <marginpolish_images> \
-b <batch_size> \
-m <r941_flip235_v001.pkl> \
-o <output_dir/> \
-p <output_filename_prefix> \
-w 0 \
-t 1 \
-g

Arguments:
  -h, --help            show this help message and exit
  -i IMAGE_FILE, --image_file IMAGE_FILE
                        [REQUIRED] Path to a directory where all MarginPolish
                        generated images are.
  -m MODEL_PATH, --model_path MODEL_PATH
                        [REQUIRED] Path to a trained model (pkl file). Please
                        see our github page to see options.
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size for testing, default is 512. Please set to
                        512 or 1024 for a balanced execution time.
  -w NUM_WORKERS, --num_workers NUM_WORKERS
                        Number of workers to assign to the dataloader. Should
                        be 0 if using Docker.
  -t THREADS, --threads THREADS
                        Number of PyTorch threads to use, default is 1. This
                        may be helpful during CPU-only inference.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Path to the output directory.
  -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the output file. Default is:
                        HELEN_prediction
  -g, --gpu_mode        If set then PyTorch will use GPUs for inference.
```
###### Run stitch.py
Finally you can run `stitch.py` to get a consensus sequence:
```bash
sudo nvidia-docker run -v <path/to/input>:/data kishwars/helen:0.0.1.gpu \
stitch.py \
-i <output_dir/helen_predictions_XX.hdf> \
-t <number_of_threads> \
-o <output_dir/> \
-p <output_prefix>

Arguments:
  -i INPUT_HDF, --input_hdf INPUT_HDF
                        [REQUIRED] Path to a HDF5 file that was generated
                        using call consensus.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        [REQUIRED] Path to the output directory.
  -t THREADS, --threads THREADS
                        [REQUIRED] Number of threads.
  -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the output file. Default is: HELEN_consensus

```


##### Get docker images (CPU) (not recommended)
If you want to try running the inference on CPU.
```bash
sudo docker pull kishwars/helen:0.0.1.cpu
sudo docker run kishwars/helen:0.0.1.cpu call_consensus.py -h
```

##### Run call_consensus.py (CPU)
Please gather all your data to a input directory. Then run `call_consensus.py` using the following command:
```bash
docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu call_consensus.py \
-i <marginpolish_images> \
-b <batch_size> \
-m <r941_flip235_v001.pkl> \
-o <output_dir/> \
-p <output_filename_prefix> \
-w 0 \
-t <number_of_threads>
```

##### Run stitch.py
Finally you can run `stitch.py` to get a consensus sequence:
```bash
docker run -it --rm --user=`id -u`:`id -g` --cpus=<number_of_threads> -v </your/current/directory>:/data kishwars/helen:0.0.1.cpu stitch.py \
-i <output_dir/helen_predictions_XX.hdf> \
-t <number_of_threads> \
-o <output_dir> \
-p <output_prefix>
```

## Models
#### Released models
Change in the basecaller algorithm can directly affect the outcome of HELEN. We will release trained models with new basecallers as they come out.
<center>

<table>
  <tr>
    <th>Model Name</th>
    <th>Release Date</th>
    <th>Intended base-caller</th>
    <th>Link</th>
    <th>Comment</th>
  </tr>
  <tr>
    <td>r941_flip231_v001.pkl</td>
    <td>29/05/2019</td>
    <td>Guppy 2.3.1</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/v0.0.1/r941_flip231_v001.pkl">Model_link</a></td>
    <td>The model is trained on chr1-6 of CHM13 <br>with Guppy 2.3.1 base called data.</td>
  </tr>
  <tr>
    <td>r941_flip233_v001.pkl</td>
    <td>29/05/2019</td>
    <td>Guppy 2.3.3</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/v0.0.1/r941_flip233_v001.pkl">Model_link</a></td>
    <td>The model is trained on autosomes of HG002 except <br>chr 20 with Guppy 2.3.3 base called data.</td>
  </tr>
  <tr>
    <td>r941_flip235_v001.pkl</td>
    <td>29/05/2019</td>
    <td>Guppy 2.3.5</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/v0.0.1/r941_flip235_v001.pkl">Model_link</a></td>
    <td>The model is trained on autosomes of HG002 except <br>chr 20 with Guppy 2.3.5 base called data.</td>
  </tr>
  <tr>
      <td>r941_flip305_v001.pkl</td>
      <td>06/11/2019</td>
      <td>Guppy 3.0.5</td>
      <td><a href="https://storage.googleapis.com/kishwar-helen/helen_trained_models/guppy305_trained_models/r941_flip305_helen.pkl">Model_link</a></td>
      <td>The model is trained on autosomes of HG002 except <br>chr 20 with Guppy 3.0.5 base called data.</td>
    </tr>
</table>
</center>

We have seen significant difference in the homopolymer base-calls between different basecallers. It is important to pick the right version for the best polishing results.

Confusion matrix of Guppy 2.3.1 on CHM13 chromosome X:
<img src="img/Figure4b.png" alt="guppy235" width="1080p"> <br/>

#### Model Schema

HELEN implements a Recurrent-Neural-Network (RNN) based Multi-task learning model with hard parameter sharing. It implements a sliding window method where it slides through the input sequence in chunks. As each input sequence is evaluated independently, it allows HELEN to use mini-batch during training and testing.

<p align="center">
<img src="img/model_schema.svg" alt="pipeline.svg" height="640p">
</p>

## Runtime and Cost
`MarginPolish-HELEN` ensures runtime consistency and cost efficiency. We have tested our pipeline on `Amazon Web Services (AWS)` and `Google Cloud Platform (GCP)` to ensure scalability.

We studied several samples of 50-60x coverage and created a suggestion framework for running the polishing pipeline. Please be advised that these are cost-optimized suggestions. For better run-time performance you can use more resources.
#### Google Cloud Platform (GCP)
For `MarginPolish` please use n1-standard-64 (64 vCPUs, 240GB RAM) instance. <br/>
Our estimated run-time is: 12 hours
Estimated cost for `MarginPolish`: <b>$33</b>

For `HELEN`, our suggested instance type is:
* Instance type: n1-standard-32 (32 vCPUs, 120GB RAM)
* GPUs: 2 x NVIDIA Tesla P100
* Disk: 2TB SSD
* Cost: $4.65/hour

The estimated runtime with this instance type is 4 hours. <br>
The estimated cost for `HELEN` is <b>$28</b>.

Total estimated run-time for polishing: 18 hours. <br/>
Total estimated cost for polishing: <b>$61</b>

#### Amazon Web Services (AWS)
For `MarginPolish` we recommend c5.18xlarge (72 CPU, 144GiB RAM) instance. <br/>
Our estimated run-time is: 12 hours
Estimated cost for `MarginPolish`: <b>$39</b>

We recommend using `p2.8xlarge` instance type for `HELEN`. The configuration is as follows:
* Instance type: p2.8xlarge (32 vCPUs, 488GB RAM)
* GPUs: 8 x NVIDIA Tesla K80
* Disk: 2TB SSD
* Cost: $7.20/hour
* Suggested AMI: Deep Learning AMI (Ubuntu) Version 23.0

The estimated runtime with this instance type: 4 hours <br>
The estimated cost for `HELEN` is: <b>$36</b>

Total estimated run-time for polishing: 16 hours. <br/>
Total estimated cost for polishing: <b>$75</b>

Please see our detailed [run-time case study](docs/runtime_cost.md) documentation for better insight.

We also see significant improvement in time over other available polishing algorithm:
<p align="center">
<img src="img/Figure4d.png" alt="pipeline.svg" height="420p">
</p>

## Results
We compared `Medaka` and `HELEN` as polishing pipelines on Shasta assembly with `assess_assembly` module available from `Pomoxis`. The summary of the quality we produce is here:

<p align="center">
<img src="img/Figure4a.png" alt="error_rate" height=420p>
</p>

We also see that `MarginPolish-HELEN` perform consistently across multiple assemblers.
<p align="center">
<img src="img/Figure4c.png" alt="Multiple_assembler_error_rate" height=420p>
</p>

## Eleven high-quality assemblies
We have sequenced-assembled-polished 11 human genome assemblies at University of California, Santa Cruz with our pipeline. They can be downloaded from our [google bucket](https://console.cloud.google.com/storage/browser/kishwar-helen/polished_genomes/london_calling_2019/).

For quick links, please copy a link from this table and you can run `wget` to download the files:
```bash
wget <link>
```
The twelve assemblies with their download links:

<table>
  <tr>
    <th>Sample name</th>
    <th>Download link</th>
  </tr>
  <tr>
    <td>HG00733</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG00733_shasta_marginpolish_helen_consensus.fa">HG00733_download_link</a></td>
  </tr>

  <tr>
    <td>HG01109</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG01109_shasta_marginpolish_helen_consensus.fa">HG01109_download_link</a></td>
  </tr>
  <tr>
    <td>HG01243</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG01243_shasta_marginpolish_helen_consensus.fa">HG01243_download_link</a></td>
  </tr>
  <tr>
    <td>HG02055</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG02055_shasta_marginpolish_helen_consensus.fa">HG02055_download_link</a></td>
  </tr>
  <tr>
    <td>HG02080</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG02080_shasta_marginpolish_helen_consensus.fa">HG02080_download_link</a></td>
  </tr>
  <tr>
    <td>HG02723</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG02723_shasta_marginpolish_helen_consensus.fa">HG02723_download_link</a></td>
  </tr>
  <tr>
    <td>HG03098</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG03098_shasta_marginpolish_helen_consensus.fa">HG03098_download_link</a></td>
  </tr>
  <tr>
    <td>HG03492</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/HG03492_shasta_marginpolish_helen_consensus.fa">HG03492_download_link</a></td>
  </tr>
  <tr>
    <td>GM24143</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/GM24143_shasta_marginpolish_helen_consensus.fa">GM24143_download_link</a></td>
  </tr>
  <tr>
    <td>GM24149</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/GM24149_shasta_marginpolish_helen_consensus.fa">GM24149_download_link</a></td>
  </tr>
  <tr>
    <td>GM24385/HG002</td>
    <td><a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/GM24385_shasta_marginpolish_helen_consensus.fa">GM24385_download_link</a></td>
  </tr>
</table>


We also polished `CHM13` genome assembly available from the [Telomere-to-telomere consortium](https://github.com/nanopore-wgs-consortium/CHM13) project. <br/>
`CHM13` polished assembly is available for download from here: <a href="https://storage.googleapis.com/kishwar-helen/polished_genomes/london_calling_2019/CHM13_shasta_marginpolish_helen_consensus.fa">CHM13_download_link</a>

## Help
Please open a github issue if you face any difficulties.

## Acknowledgement
We are thankful to [Segey Koren](https://github.com/skoren) and [Karen Miga](https://github.com/khmiga) for their help with `CHM13` data and evaluation.

We downloaded our data from [Telomere-to-telomere consortium](https://github.com/nanopore-wgs-consortium/CHM13) to evaluate our pipeline against `CHM13`.

We acknowledge the work of the developers of these packages: </br>
* [Shasta](https://github.com/chanzuckerberg/shasta/commits?author=paoloczi)
* [pytorch](https://pytorch.org/)
* [ssw library](https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library)
* [hdf5 python (h5py)](https://www.h5py.org/)
* [pybind](https://github.com/pybind/pybind11)
* [hyperband](https://github.com/zygmuntz/hyperband)

## Fun Fact
<img src="https://vignette.wikia.nocookie.net/marveldatabase/images/e/eb/Iron_Man_Armor_Model_45_from_Iron_Man_Vol_5_8_002.jpg/revision/latest?cb=20130420194800" alt="guppy235" width="240p"> <img src="https://vignette.wikia.nocookie.net/marveldatabase/images/c/c0/H.E.L.E.N._%28Earth-616%29_from_Iron_Man_Vol_5_19_002.jpg/revision/latest?cb=20140110025158" alt="guppy235" width="120p"> <br/>

The name "HELEN" is inspired from the A.I. created by Tony Stark in the  Marvel Comics (Earth-616). HELEN was created to control the city Tony was building named "Troy" making the A.I. "HELEN of Troy".

READ MORE: [HELEN](https://marvel.fandom.com/wiki/H.E.L.E.N._(Earth-616))



© 2019 Kishwar Shafin, Trevor Pesout, Benedict Paten.


