.. CPAT documentation master file, created by
   sphinx-quickstart on Thu Aug  9 10:38:29 2012.

Introduction
================================


Using RNA-seq, tens of thousands of novel transcripts and isoforms have been identified (Djebali, et al  Nature, 2012
, Carbili et al, Gene & Development, 2011)
The discovery of these hidden transcriptome rejuvenate the need of distinguishing coding
and noncoding RNA. However, Most previous coding potential prediction methods heavily rely
on alignment, either pairwise alignment to search for protein evidence or multiple alignments
to calculate phylogenetic conservation score (such as `CPC <http://cpc.cbi.pku.edu.cn/>`_ , `PhyloCSF <http://compbio.mit.edu/PhyloCSF>`_ and `RNACode <http://wash.github.com/rnacode/>`_ ). This is because most previously identified transcripts
including **protein coding RNA** and **short, housekeeping/regulatory RNAs** such as snRNAs,
snoRNA and tRNA are highly conserved. While still very useful, these approaches have several
limitations:

1) Most lncRNAs are less conserved and tend to be lineage specific which greatly limit the discrimination power of alignment-based methods. For example, of 550 lncRNAs detected from zebrafish, only 29 of them had detectable sequence similarity with putative mammalian orthologs (Ulitsky  et al, Cell, 2011).

2) A significant fraction of protein coding genes may have an alternatively processed isoform or one transcribed from an alternative promoter, these part of ncRNA cannot be correctly classified through homologous search because they would have significant match to protein coding genes.

3) Alignment based method is extremely slow. For example, `CPC <http://www.ncbi.nlm.nih.gov/pubmed/17631615>`_ takes 6050 CPU minutes (> 4 days) to evaluate 14,000 lncRNA transcripts.

4) Reliability depends on alignment quality. Most multi-alignment tools use heuristic search and do not guarantee to give optimal alignments.


CPAT overcomes the above issues by using logistic regression model based on 4 pure sequence-based, 
linguistic features

1) ORF size
2) ORF coverage
3) `Fickett TESTCODE <http://nar.oxfordjournals.org/content/10/17/5303.abstract>`_
4) `Hexamer usage bias <http://nar.oxfordjournals.org/content/20/24/6441.abstract>`_

Linguistic features based method does not require other genomes or protein databases to perform alignment
and is more robust. Because it is alignment free, it runs much faster and also easier to use.
For example,  CPAT only took several minutes to evaluate the above 14,000 lncRNAs. More importantly,
compared with alignment-based approaches, CPAT achieves better sensitivity and
specificity (0.966 tested on human gene annotation). 

Command line usage
===================

	-h --help		show this help message and exit
	
	-p --species	{Human,Mouse,Fly,Zebrafish}
					specify the species of the LncRNAs choose from Human 
					Mouse Fly Zebrafish (note that the first character 
					is upper case)
					
	-s --start		Start codon (DNA sequence, so use 'T' instead of 'U')
					used to define open reading frame (ORF), default is ATG
					
	-t --stop		Stop codon (DNA sequence, so use 'T' instead of 'U')
					used to define open reading frame (ORF). Multiple stop
					codons should be separated by ',' default is TAG,TAA,TGA

**example**
	ezLncPred CPAT -h
	ezLncPred -i example.fa -o results CPAT
	ezLncPred -i example.fa -o results CPAT -p Human
	ezLncPred -i example.fa -o results CPAT -s TAG
	ezLncPred -i example.fa -o results CPAT -t ATG,TGA,TTA

LICENSE
==========
CPAT is distributed under `GNU General Public License <http://www.gnu.org/copyleft/gpl.html>`_

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version. This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details. You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301 USA

Reference
==========
Wang, L., Park, H. J., Dasari, S., Wang, S., Kocher, J.-P., & Li, W. (2013). CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research, 41(6), e74. `doi:10.1093/nar/gkt006 <http://nar.oxfordjournals.org/content/41/6/e74>`_

Contact                        
====================
* Liguo Wang: wang.liguo AT mayo.edu
* Hyun Jung Park: hjpark AT bcm.edu
