Project 17 : Comparative Analysis of Acquisition Functions in Bayesian Optimization for Drug Discovery¶

SUBMISSION REPORT

Project leads¶

  • Suneel Kumar BVS (Molecular Forecaster). GitHub

Contributors¶

  • Jakub Lála (Imperial College London). GitHub
  • Luis Walter (Heidelberg University). GitHub
  • Curtis Chong (University of Waterloo). GitHub
  • Yunheng Zou (University of Waterloo). GitHub
  • Jan Christopher Spies (University of Muenster). GitHub

Abstract:¶

This project investigates the comparative analysis of various acquisition function methods on the efficiency of Bayesian Optimization (BO) in the drug discovery process, particularly focusing on small, diverse, unbalanced, and noisy datasets. The study will evaluate the impact of different acquisition functions, molecular featurization methods, and applicability domain (AD) across multiple drug discovery datasets to uncover optimal strategies and best practices for employing acquisition functions (AF) effectively in drug discovery challenges

Featurizations:¶

  • There are multiple fingerprints are available for molecular featurization (such as., ECFP, Morgan, and atom-pair fingerprints, molecular featurization techniques also include MACCS keys, Topological Torsion, and PubChem fingerprints) - each providing unique insights into molecular structure for computational analysis.
  • For current study, we have implemented ECFP fingerprints and explored different lengths (512, 1024, 2048) to evaluate their impact on the predictability of active learning models, seeking to balance detail representation with computational efficiency.

ECFP Fingerprint (Stands for Extended-Connectivity Fingerprints):¶

  • Represents the molecular structure as a binary string, capturing the presence of specific substructures.
  • Commonly used in cheminformatics for drug discovery to compare molecular similarity.
  • Radius: Defines the size of the atom's neighborhood considered when generating the fingerprint.
  • Influences the specificity of the fingerprint; a small radius might be too local, missing important context, whereas a very large one might be less discriminative.

ecfp.jpg

  • Length (Bit Size): Refers to the length of the binary string in the fingerprint.
  • 512 bits: Offers a more compact representation, leading to faster computation and less memory usage.
  • Might be prone to more collisions, where different substructures map to the same bit, which can reduce the discriminative power.
  • 1024 bits: Often used as a standard size providing a balance between resolution and computational efficiency.
  • Sufficient for a wide range of small to medium-sized molecules.
  • 2048 bits: Provides a higher resolution than 1024, capturing more detail and potentially improving similarity measures.
  • Useful for larger and more complex molecules where a 1024-bit fingerprint may have collisions (different substructures hashing).

Datasets:¶

  • We've sourced three datasets from the Therapeutics Data Commons (TDC) and enhanced them through careful review and advanced featurization. These steps enrich the data, making it more valuable for developing precise and reliable predictive models. Here is the details of the datasets:

  • hERG Central (Source: TDC):

  • The Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating

  • This classification is based on whether hERG inhibition_at_10uM < -50, i.e., whether the compound has an IC50 of less than 10µM.

Data Statistics¶

Dataset Total Compounds Number of 0s Number of 1s
Original Data 306,893 293,149 13,744
Cleaned Dataset 288,787 275,880 12,907

Datasets:¶

  • AMES Mutagenicity (Source: TDC)
  • Mutagenicity refers to the capability of a substance to induce genetic alterations.
  • Goal is to predict whether it is mutagenic (1) or not mutagenic (0).

Dataset Statistics¶

Dataset Total Compounds Number of 0s Number of 1s
Original Data 7278 3304 3974
Cleaned Dataset 3533 1463 2070

Datasets:¶

  • Half life (Source: TDC) The half-life of a drug refers to the time it takes for the concentration of the drug in the body to be reduced by half.

Dataset Statistics¶

Dataset Description Total Compounds
Original Data 667
Cleaned Dataset 489
  • Acute Toxicity LD50
  • Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects.
  • The higher the dose, the more lethal of a drug.

Dataset Statistics¶

Dataset Description Total Compounds
Original Data 667
Cleaned Dataset 489

About the package:¶

Installation using individual packages:¶

To install an editable version of the package, run the following command:

pip install PyTDC==0.3.6
pip install xgboost
pip install torch
pip install gpytorch
pip install requests