Abstract:¶
This project investigates the comparative analysis of various acquisition function methods on the efficiency of Bayesian Optimization (BO) in the drug discovery process, particularly focusing on small, diverse, unbalanced, and noisy datasets. The study will evaluate the impact of different acquisition functions, molecular featurization methods, and applicability domain (AD) across multiple drug discovery datasets to uncover optimal strategies and best practices for employing acquisition functions (AF) effectively in drug discovery challenges
Featurizations:¶
There are multiple fingerprints are available for molecular featurization (such as., ECFP, Morgan, and atom-pair fingerprints, molecular featurization techniques also include MACCS keys, Topological Torsion, and PubChem fingerprints).
Each providing unique insights into molecular structure for computational analysis.
For current study, we have implemented ECFP fingerprints and explored different lengths (512, 1024, 2048) to evaluate their impact on the predictability of active learning models, seeking to balance detail representation with computational efficiency.
ECFP Fingerprint (Stands for Extended-Connectivity Fingerprints):¶
- Represents the molecular structure as a binary string, capturing the presence of specific substructures. Commonly used in cheminformatics.
- Radius: Defines the size of the atom's neighborhood considered when generating the fingerprint.
- Length (Bit Size): Refers to the length of the binary string - 512 bits: more compact representation, leading to faster computation and less memory usage. 1024 bits: standard size providing a balance between resolution and efficiency. Sufficient for small to medium-sized molecules. 2048 bits: Provides a higher resolution than 1024, capturing more detail and potentially improving similarity measures.
Datasets:¶
We've sourced three datasets from the Therapeutics Data Commons (TDC) and enhanced them through careful review and advanced featurization. These steps enrich the data, making it more valuable for developing precise and reliable predictive models. Here is the details of the datasets:
hERG Central (Source: TDC):
The Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating
This classification is based on whether hERG inhibition_at_10uM < -50, i.e., whether the compound has an IC50 of less than 10µM.
Statistics¶
Dataset | Total Compounds | Number of 0s | Number of 1s |
---|---|---|---|
Original Data | 306,893 | 293,149 | 13,744 |
Cleaned Dataset | 288,787 | 275,880 | 12,907 |
Datasets:¶
- AMES Mutagenicity (Source: TDC)
- Mutagenicity refers to the capability of a substance to induce genetic alterations.
- Goal is to predict whether it is mutagenic (1) or not mutagenic (0).
Statistics¶
Dataset | Total Compounds | Number of 0s | Number of 1s |
---|---|---|---|
Original Data | 7278 | 3304 | 3974 |
Cleaned Dataset | 3533 | 1463 | 2070 |
- Acute Toxicity LD50
- Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects.
- The higher the dose, the more lethal of a drug.
Statistics¶
Dataset Description | Total Compounds |
---|---|
Original Data | 7300 |
Cleaned Dataset | 3400 |
- We selected LD50 data for the current study. Implemened Random Forest and Guassian models with 5 AFs.
Dataset Distribution: (Need of cleaning)¶
- For current activity, we have cleaned dataset for property threshold and filtered dataset considered for featurization. Datasets needs to be reviewed carefully, but couldnt acheive due to the timelines.
Acquisition Function:¶
Greedy Predicitive power)**: Predictive performance of the Greedy Acquisition Function across 20 iterations on the LD50 dataset, using ECFP with a length of 2048, radius of 2.
Normalized Acquisition Values (Green Line): The acquisition values are normalized, meaning they're scaled to a range, often [0, 1], for visualization purposes.
New Observations (Red Dots): These dots mark the points that have been selected for labeling in each iteration based on the acquisition values
Acquisition Function:¶
- Greedy Predicitive power: Predictive performance of the Greedy Acquisition Function across 20 iterations on the LD50 dataset, using ECFP with a length of 2048, radius of 2.
Acquisition Function:¶
- Greedy Predicitive power: Predictive performance of the Acquisition Functions at 6th iteration on the LD50 dataset, using ECFP with a length of 2048, radius of 2.
Model Performace (Random Forest):¶
- Observations: The "Expected_" and "Greedy" AFs show a notable increase in the no of predictions over the 90th percentile, with the Upper Confidence Bound having a slight edge.
Model Performace (Gaussian Process):¶
- Observations: GP seems to be struggles with high dimensional space (ECFP2048).
IMPACT of ECFP length on AF and Model performance:¶
Summary:¶
Successfully explored two types of ECFP fingerprint - one with ECFP1024, and ECFP2048 and we implemented 5 different acquisition functions.
Random Forest performed well, where Gaussian Process (GP) seems to be struggles with high dimensional spaces.
We ran 20 iterations of each acquisition functions and studied the batch selection and model performance. Greedy and Upper Confidence Bound seems to be addressing the chemical space effectively (Random forest).
Overall, Expected Improvement, Greedy, and Upper Confidence Bound outperforms the other acquisition functions.
EP outperforms in RF irrespective of fingerprint length (ECFP2048 and ECFP1024).
Future Plans:¶
- We are planning more fingerprints and machine learning models to see AF impact on overall model performace and chemical space coverge.