Project 17 : Comparative Analysis of Acquisition Functions in Bayesian Optimization for Drug Discovery¶

SUBMISSION REPORT

official Repo | short video

Project leads¶

  • Suneel Kumar BVS (Molecular Forecaster). GitHub

Contributors¶

  • Jakub Lála (Imperial College London). GitHub
  • Luis Walter (Heidelberg University). GitHub
  • Curtis Chong (University of Waterloo). GitHub
  • Yunheng Zou (University of Waterloo). GitHub
  • Jan Christopher Spies (University of Muenster). GitHub

Abstract:¶

This project investigates the comparative analysis of various acquisition function methods on the efficiency of Bayesian Optimization (BO) in the drug discovery process, particularly focusing on small, diverse, unbalanced, and noisy datasets. The study will evaluate the impact of different acquisition functions, molecular featurization methods, and applicability domain (AD) across multiple drug discovery datasets to uncover optimal strategies and best practices for employing acquisition functions (AF) effectively in drug discovery challenges

Featurizations:¶

  • There are multiple fingerprints are available for molecular featurization (such as., ECFP, Morgan, and atom-pair fingerprints, molecular featurization techniques also include MACCS keys, Topological Torsion, and PubChem fingerprints).

  • Each providing unique insights into molecular structure for computational analysis.

  • For current study, we have implemented ECFP fingerprints and explored different lengths (512, 1024, 2048) to evaluate their impact on the predictability of active learning models, seeking to balance detail representation with computational efficiency.

ECFP Fingerprint (Stands for Extended-Connectivity Fingerprints):¶

  • Represents the molecular structure as a binary string, capturing the presence of specific substructures. Commonly used in cheminformatics.
  • Radius: Defines the size of the atom's neighborhood considered when generating the fingerprint.

ecfp.jpg

  • Length (Bit Size): Refers to the length of the binary string - 512 bits: more compact representation, leading to faster computation and less memory usage. 1024 bits: standard size providing a balance between resolution and efficiency. Sufficient for small to medium-sized molecules. 2048 bits: Provides a higher resolution than 1024, capturing more detail and potentially improving similarity measures.

Datasets:¶

  • We've sourced three datasets from the Therapeutics Data Commons (TDC) and enhanced them through careful review and advanced featurization. These steps enrich the data, making it more valuable for developing precise and reliable predictive models. Here is the details of the datasets:

  • hERG Central (Source: TDC):

  • The Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating

  • This classification is based on whether hERG inhibition_at_10uM < -50, i.e., whether the compound has an IC50 of less than 10µM.

Statistics¶

Dataset Total Compounds Number of 0s Number of 1s
Original Data 306,893 293,149 13,744
Cleaned Dataset 288,787 275,880 12,907

Datasets:¶

  • AMES Mutagenicity (Source: TDC)
  • Mutagenicity refers to the capability of a substance to induce genetic alterations.
  • Goal is to predict whether it is mutagenic (1) or not mutagenic (0).

Statistics¶

Dataset Total Compounds Number of 0s Number of 1s
Original Data 7278 3304 3974
Cleaned Dataset 3533 1463 2070

Datasets (more..):¶

  • Half life (Source: TDC) The half-life of a drug refers to the time it takes for the concentration of the drug in the body to be reduced by half.

Statistics¶

Dataset Description Total Compounds
Original Data 667
Cleaned Dataset 489
  • Acute Toxicity LD50
  • Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects.
  • The higher the dose, the more lethal of a drug.

Statistics¶

Dataset Description Total Compounds
Original Data 7300
Cleaned Dataset 3400
  • We selected LD50 data for the current study. Implemened Random Forest and Guassian models with 5 AFs.

Dataset Distribution: (Need of cleaning)¶

dataset.png

  • For current activity, we have cleaned dataset for property threshold and filtered dataset considered for featurization. Datasets needs to be reviewed carefully, but couldnt acheive due to the timelines.

About the package:¶

Installation using individual packages:¶

To install an editable version of the package, run the following command:

pip install PyTDC==0.3.6
pip install xgboost
pip install torch
pip install gpytorch
pip install requests

Acquisition Function:¶

  • Greedy Predicitive power)**: Predictive performance of the Greedy Acquisition Function across 20 iterations on the LD50 dataset, using ECFP with a length of 2048, radius of 2.

  • Normalized Acquisition Values (Green Line): The acquisition values are normalized, meaning they're scaled to a range, often [0, 1], for visualization purposes.

  • New Observations (Red Dots): These dots mark the points that have been selected for labeling in each iteration based on the acquisition values

Acquisition Function:¶

  • Greedy Predicitive power: Predictive performance of the Greedy Acquisition Function across 20 iterations on the LD50 dataset, using ECFP with a length of 2048, radius of 2. GP_AF_Selection.png

Acquisition Function:¶

  • Greedy Predicitive power: Predictive performance of the Acquisition Functions at 6th iteration on the LD50 dataset, using ECFP with a length of 2048, radius of 2. Greedy_AF.gif

Model Performace (Random Forest):¶

ld50_dataset_using_random_forest.png

  • Observations: The "Expected_" and "Greedy" AFs show a notable increase in the no of predictions over the 90th percentile, with the Upper Confidence Bound having a slight edge.

Model Performace (Gaussian Process):¶

GP_results.png

  • Observations: GP seems to be struggles with high dimensional space (ECFP2048).

IMPACT of ECFP length on AF and Model performance:¶

ECFP_LENGHT_IMPACT.png

Summary:¶

  • Successfully explored two types of ECFP fingerprint - one with ECFP1024, and ECFP2048 and we implemented 5 different acquisition functions.

  • Random Forest performed well, where Gaussian Process (GP) seems to be struggles with high dimensional spaces.

  • We ran 20 iterations of each acquisition functions and studied the batch selection and model performance. Greedy and Upper Confidence Bound seems to be addressing the chemical space effectively (Random forest).

  • Overall, Expected Improvement, Greedy, and Upper Confidence Bound outperforms the other acquisition functions.

  • EP outperforms in RF irrespective of fingerprint length (ECFP2048 and ECFP1024).

Future Plans:¶

  • We are planning more fingerprints and machine learning models to see AF impact on overall model performace and chemical space coverge.