Metadata-Version: 2.1
Name: bambu-qsar
Version: 0.0.14
Summary: bambu (bioassays model builder) is CLI tool to build QSAR models based on PubChem BioAssays datasets
Author: Isadora Leitzke Guidotti, Frederico Schmitt Kremer
Author-email: leitzke.gi@gmail.com, fred.s.kremer@gmail.com
License: UNKNOWN
Keywords: bioinformatics machine-learning data science drug discovery QSAR
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: flaml
Requires-Dist: imbalanced-learn
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: lightgbm
Requires-Dist: retrying
Requires-Dist: gensim (==3.8.3)

# Bambu

![](https://img.shields.io/docker/pulls/omixlab/bambu-qsar)

![](https://img.shields.io/pypi/dm/bambu-qsar)

![](media/logo.png)

Bambu (*BioAssays Model Builder*), is a simple tool to generate QSAR models based on [PubChem BioAssays](https://pubchem.ncbi.nlm.nih.gov/) datasets. It relies on [RDKit](https://rdkit.org/) and the [FLAML AutoML framework](https://github.com/microsoft/FLAML) and provides utilitaries for downloading and preprocessing datasets, as well training and running the predictive models.

## Try it!

[Try Bambu on this Google Colab Notebook ^^](https://colab.research.google.com/github/omixlab/bambu-v2/blob/main/notebooks/Bambu%20Google%20Colab%20Tutorial.ipynb)

## Installing

### Installing as a PyPI package using `pip`:

```
$ pip install bambu-qsar
```

**Note:** RDKit must be installed separately.

### Intalling as an environment using `conda` on Linux:

```
$ git clone https://github.com/omixlab/bambu-v2
$ cd bambu-qsar
$ make setup PLATFORM=linux
$ conda activate bambu-qsar
```

### Running it with Docker

```
$ docker run -ti omixlab/bambu-qsar:latest
```

## Downloading a PubChem BioAssays data

Downloads a PubChem BioAssays data and save in a CSV file, containing the InchI representation and the label indicating molecules that were found to be active or inactive against a given target.

```
$ bambu-download \
	--pubchem-assay-id 29 \
	--pubchem-InchI-chunksize 100 \
	--output 29_raw.csv
```

The generated output contains the columns `pubchem_molecule_id` (Substance ID or Compound ID, depending on the option selected during download), `InChI` and `activity`. Only the fields `InchI` and `activity` are used in futher steps. 

|pubchem_molecule_id|pubchem_molecule_type|InChI                                                                                                                                                                                                                                                                                                                                                                                                              |activity|
|-------------------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
|596                |compounds            |InChI=1S/C9H13N3O5/c10-5-1-2-12(9(16)11-5)8-7(15)6(14)4(3-13)17-8/h1-2,4,6-8,13-15H,3H2,(H2,10,11,16)                                                                                                                                                                                                                                                                                                              |active  |
|1821               |compounds            |InChI=1S/C9H11FN2O6/c10-3-1-12(9(17)11-7(3)16)8-6(15)5(14)4(2-13)18-8/h1,4-6,8,13-15H,2H2,(H,11,16,17)                                                                                                                                                                                                                                                                                                             |active  |
|2019               |compounds            |InChI=1S/C62H86N12O16/c1-27(2)42-59(84)73-23-17-19-36(73)57(82)69(13)25-38(75)71(15)48(29(5)6)61(86)88-33(11)44(55(80)65-42)67-53(78)35-22-21-31(9)51-46(35)64-47-40(41(63)50(77)32(10)52(47)90-51)54(79)68-45-34(12)89-62(87)49(30(7)8)72(16)39(76)26-70(14)58(83)37-20-18-24-74(37)60(85)43(28(3)4)66-56(45)81/h21-22,27-30,33-34,36-37,42-45,48-49H,17-20,23-26,63H2,1-16H3,(H,65,80)(H,66,81)(H,67,78)(H,68,79)|active  |
|2082               |compounds            |InChI=1S/C12H15N3O2S/c1-3-6-18-8-4-5-9-10(7-8)14-11(13-9)15-12(16)17-2/h4-5,7H,3,6H2,1-2H3,(H2,13,14,15,16)                                                                                                                                                                                                                                                                                                        |active  |
|2569               |compounds            |InChI=1S/C15H19N3O5/c1-8-11(17-3-4-17)14(20)10(9(22-2)7-23-15(16)21)12(13(8)19)18-5-6-18/h9H,3-7H2,1-2H3,(H2,16,21)                                                                                                                                                                                                                                                                                                |active  |
|2674               |compounds            |InChI=1S/C29H26O10/c1-10(30)5-12-18-19-13(6-11(2)31)29(37-4)27(35)21-15(33)8-17-23(25(19)21)22-16(38-9-39-17)7-14(32)20(24(18)22)26(34)28(12)36-3/h7-8,10-11,30-31,34-35H,5-6,9H2,1-4H3                                                                                                                                                                                                                            |active  |
|2693               |compounds            |InChI=1S/C31H30N6O6S4/c1-33-25(42)30(15-38)34(2)23(40)28(33,44-46-30)12-17-13-36(21-11-7-4-8-18(17)21)27-14-29-24(41)35(3)31(16-39,47-45-29)26(43)37(29)22(27)32-20-10-6-5-9-19(20)27/h4-11,13,22,32,38-39H,12,14-16H2,1-3H3                                                                                                                                                                                       |active  |


## Computing descriptors or fingerprints

Computes molecule descriptors or Morgan fingerprints for a given datasets produced by `bambu-download` (or following the same format). The output also contains a `train` and `test` subsets, whose sizes are defined based on the `--train-test-split-percent` argument. The argument `--resample` might be used to perform a random undersampling in the dataset, as most HTS datasets are heavily umbalanced. The path passed to `--output` is used as template to generate the `train` and `test` file. In this case, `29_preprocess_train.csv` and `29_preprocess_test.csv` respectively.

```
$ bambu-preprocess \
	--input 29_raw.csv \
	--output 29_preprocess.csv \
	--output-preprocessor 29_preprocessor.pickle \
	--feature-type morgan-2048 \
	--train-test-split 0.75 \
	--undersample
``` 

## Train

Trains a classification model using the FLAML AutoML framework based on the `bambu-preprocess` output datasets. The user may adjust most of the `flaml.automl.AutoML` parameters using the command line arguments. In this case we are using an Extra Trees Classifier.

```
$ bambu-train \
	--input-train 29_preprocess_train.csv \
	--output 29_model.pickle \
	--time-budget 3600 \
	--estimators extra_tree
``` 

A list of all available estimators can be accessed using the command `bambu-train --list-estimators`. Currently, only `rf` (*Random Forest*) and `extra_tree` are available.

## Validation

An y-randomization validation can be performed using the command `bambu-validate`, which will compute accuracy, recall, precision, f1-score and ROC AUC training with the original training dataset and by validating 
with the test one, and furtherly randomizing the training labels and several times (`--randomizations`). For each randomization, classification metrics
are computed again and significancy value (p-value) is computed based
on the z-score-normalized metrics.

```
$ bambu-validate \
	--input-train 29_preprocess_train.csv \
	--input-test 29_preprocess_test.csv \
	--model 29_model.pickle \
	--output 29_model.validation.json \
	--randomizations 100
```

## Predict

Receives an inputs, preprocess it using a preprocess object (generated using `bambu-preprocess`) and then runs a classification model (generated using `bambu-train`). Results are saved in a CSV file.

```
$ bambu-predict \
	--input pubchem_compounds.sdf \
	--preprocessor 29_preprocessor.pickle \
	--model 29_model.pickle \
	--output 29_predictions.csv
``` 

# Contact

Feel free to open issues or pull requests! You may also contact us by email.

Dr. Frederico Schmitt Kremer, PhD. E-mail: [fred.s.kremer@gmail.com](fred.s.kremer@gmail.com).


