Metadata-Version: 2.1
Name: molbar
Version: 1.1.0
Summary: Molecular Barcode (MolBar): Molecular Identifier for Organic and Inorganic Molecules
Home-page: https://git.rwth-aachen.de/bannwarthlab/MolBar
Author: Nils van Staalduinen
Author-email: Nils van Staalduinen <van.staalduinen@pc.rwth-aachen.de>, Christoph Bannwarth <bannwarth@pc.rwth-aachen.de>
Maintainer-email: Nils van Staalduinen <van.staalduinen@pc.rwth-aachen.de>
License: MIT License
        
        Copyright (c) 2022 Nils van Staalduinen, Christoph Bannwarth
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://git.rwth-aachen.de/bannwarthlab/molbar/
Project-URL: Documentation, https://git.rwth-aachen.de/bannwarthlab/molbar/
Project-URL: Repository, https://git.rwth-aachen.de:bannwarthlab/molbar.git
Project-URL: Issues, https://git.rwth-aachen.de/bannwarthlab/molbar/-/issues
Project-URL: Changelog, https://git.rwth-aachen.de/bannwarthlab/molbar/-/issues
Keywords: molecular identifier,chemical data science,stereoisomerism
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: networkx ==3.1
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: tqdm
Requires-Dist: joblib
Requires-Dist: numba
Requires-Dist: ase
Requires-Dist: dscribe
Requires-Dist: numpy >=1.21
Requires-Dist: pyyaml

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# MolBar

<div align="center">
<img src="logo.png" alt="logo" width="400" />
</div>

This package provides an implementation of the Molecular Barcode (MolBar) as a quantum chemistry-inspired molecular identifier to ensure data uniqueness in databases, supporting organic and inorganic molecules while attempting to describe relative and absolute configuration including centric, axial/helical and planar chirality.

It does this by fragmentating a molecule into rigid parts which are then idealized with a specialized non-physical force field. The molecule is then described by different matrices encoding topology (connectivity), topography (3D positions of atoms after input unification), and absolute configuration (by calculating a chirality index). The final barcode is the concatenated spectra of these matrices.

## Current Limitations 

So far, it should work well for organic and inorganic molecules with typical 2c2e bonding. It can describe molecules based on their relative and absolute configuration, including centric, axial/helical and planar chirality.

As the usual starting point are 3D Cartesian coordinates, right now, problems can occur if it is not easy to determine which atoms are bonded, especially for metal complexes with η-bonds. Further, problems can occur if the geometry around a metal in a complex cannot be classified by one of the standard VSEPR model. If you are not sure, just use the -d option when using MolBar as a commandline tool or use  write_trj=True when using MolBar as a Python module to look at the optimized trajectories of each fragment. If something is unclear to you or something unusual happens, I would appreciate if you report it by posting issues or by e-mail (van.staalduinen@pc.rwth-aachen.de).

For rigidity analysis, MolBar only considers double/triple bonds and rings to be rigid. For example, an obstacle to rotation due to bulkiness of substituents is not taken into account, but can be added manually from the input file (additional dihedral constraint, but that should be used as an exception and carefully).

So far, .xyz, .coord as well as .mol/.sdf containing 3D coordinates are supported. Support for .mol/.sdf files with 2D coordinates is coming soon.

## Getting started (tested on Linux and macOS, compiling works for Windows only in WSL)

### For Linux/macOS

Using a virtual environment is highly recommended because it allows you to create isolated environments with their own dependencies, without interfering with other Python projects or the system Python installation. This ensures that your Python environment remains consistent and reproducible across different machines and over time. To create one, type in the following command in your terminal:

```bash
 python3 -m venv path/to/venv
```
To activate the enviroment, type in:
```bash
 source path/to/venv/bin/activate
```
To install Molbar, enter`the following command in your terminal:

```bash
pip install molbar
```

### For Windows

Since compiling in a standard Windows environment does not work yet, it is highly recommended to use the WSL (Windows Subsystem for Linux) extension. Simply follow this installation guide: https://learn.microsoft.com/en-us/windows/wsl/install. Note that a Fortran compiler needs to be installed manually in the WSL environment. Otherwise, the installation of MolBar will result in an error.

For Python usage, it is highly recommended to use Visual Studio Code (VSC) as it provides specific extensions to code directly in WSL. A more detailed guide can be found here: https://code.visualstudio.com/docs/remote/wsl


# MolBar Structure

For l-alanine, the MolBar reads:

```text

MolBar | 1.0.0 | C3NO2H7 | -339 -140 -110 -32 13 20 20 20 160 237 432 528 850 | -209 -8 130 160 354 633 | -154 -117 -67 -40 9 9 9 60 74 156 342 457 922 | -31 0 0 0 11

```

MolBar is constructed as follows: 

```text
Version: 1.0.0
Molecular Formula: C3H7NO2 
Topology Spectrum: -339 -140 -110 -32 13 20 20 20 160 237 432 528 850 (Encoding atomic connectivity)
Heavy Atom Topology Spectrum: -209 -8 130 160 354 633 (Encoding atomic connectivity without hydrogen atoms. So if for two molecules, the topology spectra are different but the tautomer spectra are the same, both molecules are tautomeric structures)
Topography Spectrum : -154 -117 -67 -40 9 9 9 60 74 156 342 457 922 (3D arrangement of atoms in Cartesian space, also describes diastereomerism)
Absolute Configuration: -31 0 0 0 11 (Encoding absolute configuration for each fragment)
```

## Python Module Usage

MolBar can be generated by Python function calls:

1. for a single molecule with ```get_molbars_from_coordinates``` by specifying the Cartesian coordinates as a list,
2. for several molecules at once with ```get_molbars_from_coordinates``` by giving a list of lists with Cartesian coordinates,
3. for a single molecule with ```get_molbars_from_file``` by specifying a file path,
4. for several molecules at once with ```get_molbars_from_files``` by specifying a list of file paths.

### 1. get_molbar_from_coordinates

```python
  from molbar.barcode import get_molbar_from_coordinates

  def get_molbar_from_coordinates(coordinates: list, elements: list, return_data=False, timing=False, input_constraint=None, mode="mb") -> Union[str, dict]

      Args:

          coordinates (list): Molecular geometry provided by atomic Cartesian coordinates with shape (n_atoms, 3).
          elements (list): A list of elements in that molecule.
          return_data (bool): Whether to return MolBar data.
          timing (bool): Whether to print the duration of this calculation.
          input_constraint (dict, optional): A dict of extra constraints for the calculation. See below for more information. USED ONLY IN EXCEPTIONAL CASES.
          mode (str): Whether to calculate the molecular barcode ("mb") or only the topology part of the molecular barcode ("topo").

      Returns:

          Union[str, dict]: Either MolBar or the MolBar and MolBar data.

  ```

Example for input constraints as a Python dict. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).

  ```python
  {
  'constraints': {
                  'dihedrals': [{'atoms': [1,2,3,4], 'value':90.0},...]} #atoms: list of atoms that define the dihedral, value is the ideal dihedral angle in degrees, atom indexing starts with 1.
  }
  ```


### 2. get_molbars_from_coordinates
NOTE:
If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.

  ```python
  from molbar.barcode import get_molbars_from_coordinates

  def get_molbars_from_coordinates(list_of_coordinates: list, list_of_elements: list, return_data=False, threads=1, timing=False, input_constraints=None, progress=False,  mode="mb") -> Union[list, Union[str, dict]]:

      Args:

          list_of_coordinates (list): A list of molecular geometries provided by atomic Cartesian coordinates with shape (n_molecules, n_atoms, 3).
          list_of_elements (list): A list of element lists for each molecule in the list_of_coordinates with shape (n_molecules, n_atoms).
          return_data (bool): Whether to return MolBar data.
          threads (int): Number of threads to use for the calculation. If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
          timing (bool):  Whether to print the duration of this calculation.
          input_constraints (list, optional): A list of constraints for the calculation. Each constraint in that list is a Python dict as shown above for get_molbar_from_coordinates.
          progress (bool): Whether to show a progress bar.
          mode (str): Whether to calculate the molecular barcode ("mb") or the topology part of the molecular barcode ("topo").

      Returns:

          Union[list, Union[str, dict]]: Either MolBar or the MolBar and MolBar data.
```

### 3. get_molbar_from_file

  ```python
  from molbar.barcode import get_molbar_from_file

  def get_molbar_from_file(file: str, return_data=False, timing=False, input_constraint=None, mode="mb", write_trj=False) -> Union[str, dict]:

      Args:
          file (str): The path to the file containing the molecule information (either .xyz/.sdf/.mol format).
          return_data (bool): Whether to return MolBar data.
          timing (bool): Whether to print the duration of this calculation.
          input_constraint (dict, optional): A dict of extra constraints for the calculation. See below for more information. USED ONLY IN EXCEPTIONAL CASES.
          mode (str): Whether to calculate the molecular barcode ("mb") or only the topology part of the molecular barcode ("topo").
          write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
      
      Returns:

          Union[str, dict]: Either MolBar or the MolBar and MolBar data.

  ```

Example for input file in .yml format. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).
```yml
constraints:
  dihedrals:
    - atoms: [30, 18, 14, 13]  # List of atoms involved in the dihedral
      value:  90.0  # Actual values for the dihedral parameters
```


### 4. get_molbars_from_files

NOTE:
If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.

  ```python
  from molbar.barcode import get_molbars_from_files

  def get_molbars_from_files(files: list, return_data=False, threads=1, timing=False, input_constraints=None, progress=False, mode="mb", write_trj=False) ->Union[list, Union[str, dict]]:

      Args:

          files (list): The list of paths to the files containing the molecule information (either .xyz/.sdf/.mol format).
          return_data (bool): Whether to return MolBar data.
          threads (int): Number of threads to use for the calculation. If you need to process multiple molecules at once, it is recommended to use this function and specify the number of threads that can be used to process multiple molecules simultaneously.
          timing (bool):  Whether to print the duration of this calculation.
          input_constraints (list, optional): A list of file paths to the input files for the calculation. Each constrained is specified by a file path to a .yml file, as shown above for get_molbar_from_file.
          progress (bool): Whether to show a progress bar.
          mode (str): Whether to calculate the molecular barcode ("mb") or the topology part of the molecular barcode ("topo").
          write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.

      Returns:

          Union[list, Union[str, dict]]: Either MolBar or the MolBar and MolBar data.

  ```


## Commandline Usage

MolBar can also be used as commandline tool. Just simply type:

```
molbar coord.xyz
```
and the MolBar is printed to the stdout.

NOTE:
If you need to process several molecules at once, it is recommended to pass all molecules to the code at once (e.g. with *.xyz) while specifying the number of threads the code should use:
```bash
molbar *.xyz -T N_threads -s
```
The latter option (-s) is used to store the barcode to .mb files. 


Further, the commandline tool provides several options:

```text
usage: molbar [-h] [-r] [-i INP [INP ...]] [-d] [-T THREADS] [-s] [-t] [-p] [-m {mb,topo,opt}] files [files ...]

positional arguments:
  files                 file(s)

options:
  -m {mb,topo,opt}, --mode {mb,topo,opt}
                      The mode to use for the calculations (either "mb" (default, calculates MolBar), "topo" (topology part only)
                      or "opt" (using stand-alone force field idealization, writes ".opt" with final structure))

  -i INP [INP ...], --inp INP [INP ...]
                        Path to input file in .yml format to add further constraints. Example input can be found below.

  -d, --data           Whether to print MolBar data. 
                        Writes a "filename/" directory containing a json file with
                        important information that defines MolBar. Writes idealization trajectories of each fragment to same directory.

  -T THREADS, --threads THREADS
                        The number of threads to use for parallel processing of several files. MolBar generation for a single file is not parallelized. Should be used together with -s/--save (e.g. molbar *.xyz -T 8 -s)

  -s, --save            Whether to save the result to a file of type "filename.mb"
  -t, --time            Print out timings.

  -p, --progress        Use a progress bar when several files are handled.
```

Example for input file constraints in yml format. Input constraint should be used only in exceptional cases. However, it may be useful to constrain bonds with a additional dihedral for the barcode that are normally considered single bonds but whose rotation is hindered (e.g., 90° binol systems with bulky substituents).

```yml
constraints:
  dihedrals:
    - atoms: [30, 18, 14, 13]  # List of atoms involved in the dihedral
      value:  90.0  # Actual values for the dihedral parameters
```



## Using the unification force field for the whole molecule.

The force field can be used to idealize the structure of a whole molecule where the coordinates are either given in Python by a file:

1. as a commandline tool with the ```molbar coord.xyz -m opt``` option
2. in Python with ```idealize_structure_from_file``` by providing a file path
3. in Python with ```idealize_structure_from_coordinates``` by providing Cartesian coordinates as a list
 

### Commandline tool
```text
molbar coord.xyz -m opt
```
This writes a coord.opt file that contains the idealized coordinates.

### In Python from a file:
```python
  from molbar.barcode import idealize_structure_from_file

  def idealize_structure_from_file(file: str, return_data=False, timing=False, input_constraint=None,  write_trj=False) -> Union[list, str]

      Args:

          file (str): The path to the input file to be processed.
          return_data (bool): Whether to print MolBar data.
          timing (bool): Whether to print the duration of this calculation.
          input_constraint (str): The path to the input file containing the constraint for the calculation. See down below for more information.
          write_trj (bool, optional): Whether to write a trajectory of the unification process. Defaults to False.
      Returns:
          n_atoms (int): Number of atoms in the molecule.
          energy (float): Final energy of the molecule after idealization.
          coordinates (list): Final coordinates of the molecule after idealization.
          elements (list): Elements of the molecule.
          data (dict): Molbar data.
```

This is an example input as a yml file:
```yml
bond_order_assignment: False  # False if bond order assignment should be skipped, only reasonable opt mode (standalone force-field optimization)
cycle_detection: True # False if cycle detection should be skipped, only reasonable opt mode (standalone force-field optimization).
repulsion_charge: 100.0 # Charged used for the Coulomb term in the Force field, every atom-atom pair uses the same charge, only reasonable opt mode (standalone force-field optimization). Defaults to 100.0
set_edges: True #False if no bonds should be constrained automatically.
set_angles: True #False if no angles should be constrained automatically.
set_dihedrals: True # False if no dihedrals should be constrained automatically.
set_repulsion: True #False if no coulomb term should be used automatically.

constraints:
  bonds:
    - atoms: [19, 23]  # List of atoms involved in the bond
      value: 1.5  # Ideal bond length. 
  angles:
    - atoms: [19, 23, 35]  # List of atoms involved in the angle
      value: 45.0  # Angle to which the angle between the three atoms is to be constrained
    - atoms: [35, 23, 19]  # List of atoms involved in the angle
      value: 45.0  # Angle to which the angle between the three atoms is to be constrained

  dihedrals:
    - atoms: [30, 18, 14, 13]  # List of atoms involved in the dihedral
      value:  90.0  # Actual values for the dihedral parameters
```

### In Python from a list of Cartesian coordinates:
```python
from molbar.barcode import idealize_structure_from_coordinates

def idealize_structure_from_coordinates(coordinates: list, elements: list, return_data=False, timing=False, input_constraint=None) -> Union[list, str]:

      Args:
          coordinates (list): Cartesian coordinates of the molecule.
          elements (list): Elements of the molecule.
          return_data (bool, optional): Whether to return MolBar data.
          timing (bool, optional): Whether to print the duration of this calculation.
          input_constraint (dict, optional): The constraint for the calculation. See documentation for more information.
          
      Returns:
          n_atoms (int): Number of atoms in the molecule.
          energy (float): Final energy of the molecule after idealization.
          coordinates (list): Final coordinates of the molecule after idealization.
          elements (list): Elements of the molecule.
          data (dict): MolBar data.
```

This is an example input as a Python dict:

```python
  {'bond_order_assignment': True, #False if bond order assignment should be skipped, only reasonable opt mode (standalone force-field optimization)
  'cycle_detection': True, #False if cycle detection should be skipped, only reasonable opt mode (standalone force-field optimization).
  'set_edges': True #False if no bonds should be constrained automatically.
  'set_angles': True #False if no angles should be constrained automatically.
  'set_dihedrals': True #False if no dihedrals should be constrained automatically.
  'set_repulsion': True #False if no coulomb term should be used automatically.
  'repulsion_charge': 100.0, # Charged used for the Coulomb term in the Force field, every atom-atom pair uses the same charge, only reasonable opt mode (standalone force-field optimization). Defaults to 100.0
  'constraints': {'bonds': [{'atoms': [1,2], 'value':1.5},...], #atoms: list of atoms that define the bond, value is the ideal bond length in angstrom, atom indexing starts with 1.
                  'angles': [{'atoms': [1,2,3], 'value':90.0},...], #atoms: list of atoms that define the angle, value is the ideal angle in degrees, atom indexing starts with 1.
                  'dihedrals': [{'atoms': [1,2,3,4], 'value':180.0},...]}  #atoms: list of atoms that define the dihedral, value is the ideal dihedral angle in degrees, atom indexing starts with 1.
  }
```

## Acknowledgements

MolBar relies on the following libraries
and packages:

*   [networkx](https://networkx.org/)
*   [NumPy](https://numpy.org)
*   [SciPy](https://scipy.org)
*   [tqdm](https://github.com/tqdm/tqdm)
*   [joblib](https://joblib.readthedocs.io/en/latest/)

Thank you!

## License and Disclaimer

MIT License

Copyright (c) 2022 Nils van Staalduinen, Christoph Bannwarth

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
