Metadata-Version: 2.1
Name: gff2bed
Version: 0.3.0
Summary: Convert GFF3-formatted data to BED format
Project-URL: Homepage, https://gitlab.com/salk-tm/gff2bed
Project-URL: Documentation, https://salk-tm.gitlab.io/gff2bed
Author-email: Anthony Aylward <aaylward@salk.edu>
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.7
Description-Content-Type: text/markdown

# gff2bed

## Overview

[GFF3](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md) and [BED](https://bedtools.readthedocs.io/en/latest/content/general-usage.html) are common formats for storing the coordinates of genomic features such as genes. GFF3 format is more versatile, but BED format is simpler and enjoys a rich ecosystem of utilities such as [bedtools](https://bedtools.readthedocs.io/en/latest/index.html). For this reason, it is often convenient to store genomic features in GFF3 format and convert them to BED format for genome arithmetic.

This module provides two convenience functions to streamline converting data from GFF3 to BED format for bioinformatics analysis: `parse()`, which reads data from a GFF3 file, and `convert()`, which converts GFF3-formatted data to BED-formatted data that can be passed on e.g. to [pybedtools](https://daler.github.io/pybedtools/).

## Installation

Install `gff2bed` with `pip`

```sh
pip install gff2bed
```

## Tutorial

To follow this tutorial, first ensure you have the following modules installed
in addition to `gff2bed``:

- [urllib3](https://urllib3.readthedocs.io/en/stable/)
- [pandas](https://pandas.pydata.org/docs/)
- [pybedtools](https://daler.github.io/pybedtools/)

This tutorial will involve working with some files on disk, so we'll make a
temporary directory for easy cleanup later.

```python
from tempfile import TemporaryDirectory
temp_dir = TemporaryDirectory()
```
Next, download an example GFF3 file

```python
import urllib3
import shutil
import os.path
GFF3_URL = 'https://gitlab.com/salk-tm/gff2bed/-/raw/main/test/data/ColCEN_AT1G01010-20_TAIR10.gff3.gz'
GFF3_FILE = os.path.join(temp_dir.name, 'ColCEN_AT1G01010-20_TAIR10.gff3.gz')
http = urllib3.PoolManager()
with http.request('GET', GFF3_URL, preload_content=False) as r, open(GFF3_FILE, 'wb') as dest_file:
    shutil.copyfileobj(r, dest_file)
```

To read the GFF3 file into a Pandas data frame without converting to BED, use `gff2bed.parse()``

```python    
import pandas as pd
import gff2bed
gff_data = pd.DataFrame(gff2bed.parse(GFF3_FILE))
gff_data.head()
```

```
      0     1      2  3                                                  4
0  Chr1  7489   9757  +  {'ID': 'AT1G01010', 'Note': 'protein_coding_ge...
1  Chr1  9786  12596  -  {'ID': 'AT1G01020', 'Note': 'protein_coding_ge...
```

   > **Note:** The implementation of `gff2bed` follows a philosophy of simplicity. It depends on nothing but the built-in python libraries, and it includes nothing but the `parse()` and `convert()` functions. Typically when applying `gff2bed` in practice, you will use it in conjunction with other modules such as `pandas` or `pybedtools` 

To create a data frame of BED formatted data, pass the stream to `gff2bed.convert()` before passing to `pd.DataFrame()`

```python
bed_data = pd.DataFrame(gff2bed.convert(gff2bed.parse(GFF3_FILE)))
bed_data.head()
```

```
      0     1      2          3  4  5
0  Chr1  7488   9757  AT1G01010  0  +
1  Chr1  9785  12596  AT1G01020  0  -
```

You can similarly create a `BedTool` with `pybedtools`

```python
from pybedtools import BedTool
bed_data = BedTool(gff2bed.convert(gff2bed.parse(GFF3_FILE))).saveas()
bed_data.head()
```

```
Chr1    7488    9757    AT1G01010       0       +
 Chr1   9785    12596   AT1G01020       0       -
```

To complete the tutorial, clean up the temporary directory

```python
temp_dir.cleanup()
```
