Abstract:¶
This project investigates the comparative analysis of various acquisition function methods on the efficiency of Bayesian Optimization (BO) in the drug discovery process, particularly focusing on small, diverse, unbalanced, and noisy datasets. The study will evaluate the impact of different acquisition functions, molecular featurization methods, and applicability domain (AD) across multiple drug discovery datasets to uncover optimal strategies and best practices for employing acquisition functions (AF) effectively in drug discovery challenges
Featurizations:¶
- There are multiple fingerprints are available for molecular featurization (such as., ECFP, Morgan, and atom-pair fingerprints, molecular featurization techniques also include MACCS keys, Topological Torsion, and PubChem fingerprints) - each providing unique insights into molecular structure for computational analysis.
- For current study, we have implemented ECFP fingerprints and explored different lengths (512, 1024, 2048) to evaluate their impact on the predictability of active learning models, seeking to balance detail representation with computational efficiency.
ECFP Fingerprint (Stands for Extended-Connectivity Fingerprints):¶
- Represents the molecular structure as a binary string, capturing the presence of specific substructures.
- Commonly used in cheminformatics for drug discovery to compare molecular similarity.
- Radius: Defines the size of the atom's neighborhood considered when generating the fingerprint.
- Influences the specificity of the fingerprint; a small radius might be too local, missing important context, whereas a very large one might be less discriminative.
- Length (Bit Size): Refers to the length of the binary string in the fingerprint.
- 512 bits: Offers a more compact representation, leading to faster computation and less memory usage.
- Might be prone to more collisions, where different substructures map to the same bit, which can reduce the discriminative power.
- 1024 bits: Often used as a standard size providing a balance between resolution and computational efficiency.
- Sufficient for a wide range of small to medium-sized molecules.
- 2048 bits: Provides a higher resolution than 1024, capturing more detail and potentially improving similarity measures.
- Useful for larger and more complex molecules where a 1024-bit fingerprint may have collisions (different substructures hashing).
Datasets:¶
We've sourced three datasets from the Therapeutics Data Commons (TDC) and enhanced them through careful review and advanced featurization. These steps enrich the data, making it more valuable for developing precise and reliable predictive models. Here is the details of the datasets:
hERG Central (Source: TDC):
The Human ether-à-go-go related gene (hERG) is crucial for the coordination of the heart's beating
This classification is based on whether hERG inhibition_at_10uM < -50, i.e., whether the compound has an IC50 of less than 10µM.
Data Statistics¶
Dataset | Total Compounds | Number of 0s | Number of 1s |
---|---|---|---|
Original Data | 306,893 | 293,149 | 13,744 |
Cleaned Dataset | 288,787 | 275,880 | 12,907 |
Datasets:¶
- AMES Mutagenicity (Source: TDC)
- Mutagenicity refers to the capability of a substance to induce genetic alterations.
- Goal is to predict whether it is mutagenic (1) or not mutagenic (0).
Dataset Statistics¶
Dataset | Total Compounds | Number of 0s | Number of 1s |
---|---|---|---|
Original Data | 7278 | 3304 | 3974 |
Cleaned Dataset | 3533 | 1463 | 2070 |
Datasets:¶
- Half life (Source: TDC) The half-life of a drug refers to the time it takes for the concentration of the drug in the body to be reduced by half.
Dataset Statistics¶
Dataset Description | Total Compounds |
---|---|
Original Data | 667 |
Cleaned Dataset | 489 |
- Acute Toxicity LD50
- Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects.
- The higher the dose, the more lethal of a drug.
Dataset Statistics¶
Dataset Description | Total Compounds |
---|---|
Original Data | 667 |
Cleaned Dataset | 489 |