Metadata-Version: 2.1
Name: IBSpy
Version: 0.3.1
Summary: A package to detect IBS regions
Home-page: https://github.com/Uauy-Lab/IBSpy
Author: Ricardo H. Ramirez-Gonzalez
Author-email: ricardo.ramirez-gonzalez@jic.ac.uk
License: UNKNOWN
Description: # IBSpy
        
        ![Python package](https://github.com/Uauy-Lab/IBSpy/workflows/Python%20package/badge.svg)
        [![Maintainability](https://api.codeclimate.com/v1/badges/5a4b1b0e89f7f9f8c34c/maintainability)](https://codeclimate.com/github/Uauy-Lab/IBSpy/maintainability)
        
        Python library to identify Identical By State regions
        
        
        
        To build the mker database for kmc and the tests run this comand:
        
        ```sh
        kmc -k31 -r -ci1 -fm data/test4B.jagger.fa data/test4B.jagger.kmc_k31 tmp
        ```
        
        
        ## Installyng IBSpy
        
        There easiest way to install IBSpy is to use pip3. 
        
        ```sh
        pip3 install IBSpy
        ```
        
        
        If ```pip3``` fails, you can clone the project and compiling it with:
        
        ```sh
        pip3 install cython biopython pyfaidx
        python3 setup.py develop
        ```
        
        Then you should have the  IBSpy command available. 
        
        
        ### KMC3 
        
        If you want to use the [KMC](https://github.com/refresh-bio/KMC) binder, install the KMC and compile the python instructions.
        
        Then, run the following command to setup the path for it.  
        ```sh
        cd KMC/py_kmc_api
        source set_path.sh 
        ```
        
        
        ## Preparing the databases
        
        IBSpy requires to have a kmer database from the sequencing files. Currently two formats are supported:
        
          1. Jellyfish: Follow the instructions in its [website](https://github.com/gmarcais/Jellyfish/blob/master/doc/Readme.md)
          2. kmerGWAS: Has an adhoc file format that contains only the kmers in a binary representation, sorted. This option is faster than the jellyfish version, but creating the kmer table is less straight forward. The manual is [here](https://github.com/voichek/kmersGWAS/blob/master/manual.pdf).
        
        ## Runn unit tests
        
        To makes sure that your changes havent broken the core IBSpy, run the unit tests:
        
        ```sh
        python3 setup.py test
        ```
        
        
        ## Running IBSPy
        
        IBSpy has relatively few options, you can look at them with the ```--help``` command. 
        
        ```sh
        IBSPy --help
        usage: IBSPy [-h] [-w WINDOW_SIZE] [-k KMER_SIZE] [-d DATABASE] [-r REFERENCE]
                     [-z] [-o OUTPUT] [-f {kmerGWAS,jellyfish}]
        
        optional arguments:
          -h, --help            show this help message and exit
          -w WINDOW_SIZE, --window_size WINDOW_SIZE
                                window size to analyze
          -k KMER_SIZE, --kmer_size KMER_SIZE
                                Kmer size of the database
          -d DATABASE, --database DATABASE
                                Kmer database
          -r REFERENCE, --reference REFERENCE
                                The reference with the position of the kmers
          -z, --compress        When an ouput file is present, it is compressed as .gz
          -o OUTPUT, --output OUTPUT
                                Output file. If missing, the ouptut is sent to stdout
          -f {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}, --database_format {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}
                                Database format 
        ```
        
        To generate the table with the number of observed kmers and variants run the following command, using the kmer database from kmerGWAS use the following command:
        
        
        ```sh
         IBSpy --output "kmer_windows_LineXXX.tsv.gz" -z --database kmers_with_strand  --reference arinaLrFor.fa --window_size 50000 --compress --database_format kmerGWAS
        ```
        For KMC3, the database is the name used while creating the database, not the filename. 
        
        
        ## Running IBSplot
        
        Look at the IBSplot commands using ```--help```.
        
        ```sh
        IBSPy --help
        usage: IBSplot [-h] [-i IBSPY_COUNTS] [-w WINDOW_SIZE] [-f FILTER_COUNTS]
                       [-n N_COMPONENTS] [-c COVARIANCE_TYPE] [-s STITCH_NUMBER]
                       [-o OUTPUT] [-r REFERENCE] [-q QUERY] [-p PLOT_OUTPUT]
        
        optional arguments:
          -h, --help            show this help message and exit
          -i IBSPY_COUNTS, --IBSpy_counts IBSPY_COUNTS
                                tvs file genetared by IBSpy output
          -w WINDOW_SIZE, --window_size WINDOW_SIZE
                                Windows size to count variations within
          -f FILTER_COUNTS, --filter_counts FILTER_COUNTS
                                Filter number of variaitons above this threshold to
                                compute GMM model, default=None
          -n N_COMPONENTS, --n_components N_COMPONENTS
                                Number of componenets for the GMM model, default=3
          -c COVARIANCE_TYPE, --covariance_type COVARIANCE_TYPE
                                type of covariance used for GMM model, default="full"
          -s STITCH_NUMBER, --stitch_number STITCH_NUMBER
                                Consecutive "outliers" in windows to stitch, default=3
          -o OUTPUT, --output OUTPUT
                                tsv file with variations count by windows and summary
                                statistics
          -r REFERENCE, --reference REFERENCE
                                genome reference name
          -q QUERY, --query QUERY
                                query sample
          -p PLOT_OUTPUT, --plot_output PLOT_OUTPUT
                                histograms and ascatter files in .PDF format
        ```
        
        IBSplot uses the output table generated by IBSpy described above (e.g., ```"kmer_windows_LineXXX.tsv.gz"```). It can be used to count variant assigning larger windows. In the example below it is using 400,000 bp windows to compute  a GMM model and generate the plots.
        
        To generate the table with variant count categorized by the GMM model as IBS or non-IBS and generate the plots, run the following command:
        The description of the GMM model is [here](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture)
        
        ```sh
        # minimal arguments
        IBSplot --IBSpy_counts "kmeribs-Wheat_Jagger-Flame.tsv.gz" --window_size 400000 --output gmm_ibs.tsv.gz --reference Jagger --query Flame --plot_output gmm_plots.pdf
        ```
        
        In addition, you can include some or all of the following commands to tune the GMM model parameters and define the best IBS and non-IBS according to the reference and query sample used:
        
        ```sh
        IBSplot --filter_counts 1000 --n_components 3 --covariance_type 'full' --stitch_number 3
        ```
        
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
