Our research focuses on developing machine learning-driven pipelines for drug discovery, and their application to anti-aging intervention design. We specialize in small molecule design and the development of algorithms for efficient chemical space exploration.

Specifically, we leverage generative models as molecular search engines to significantly expand the scope of potential compounds. We also use binding affinity estimation methods to bootstrap compound screening, foundation models to handle low volumes of biological data, and active learning techniques to maximize the information gained from experimental screens.

Our primary application area is the design of senolytics, which have been linked to aging and are amenable to high-throughput screening.

Generative models

The chemical space of drug-like molecules is estimated to contain approximately 1060 compounds. Even the largest virtual libraries cover only a minuscule fraction of this space, and screening them using standard supervised methods is prohibitively expensive. Generative models hold the potential to expand this search space by sampling directly from the underlying data distribution. However, most existing models fail to produce molecules that can be easily or affordably synthesized in a wet lab, limiting their practical utility. We develop generative methods that ensure synthesizability out-of-the-box and study factors such as improving their scalability and exploration efficiency.

Binding affinity estimation

Biological experiments are expensive and time-consuming. While proxy methods such as docking and molecular dynamics simulations exist, they are either inaccurate or computationally expensive. Although recent machine learning models aim to improve this, they often fall short in practical applications, particularly due to poor out-of-distribution generalization. Our work focuses on understanding the limitations of various binding affinity estimation methods and designing efficient multi-fidelity pipelines that guide our generative models.

Foundation models

While binding affinity models are a useful starting point for compound screening, their accuracy is inherently limited, and they often require a specific protein target, which is not always available — particularly in the still-emerging field of aging research. Ultimately, we aim to train biological oracles directly on experimental data. However, the limited size of biological datasets presents a challenge. To address this, we investigate the use of chemistry foundation models as a starting point for fine-tuning on downstream biological tasks in low-data scenarios and employ multi-modal contrastive learning to incorporate unannotated data into model training.

Active learning

Although foundation models help address low-data challenges, task-specific data is still essential for reliable predictions. Given the high cost of biological data acquisition, it is crucial to optimize the experimental process by maximizing the information gained from each screened compound. We develop active learning methods to identify the most informative compounds for screening and explore approaches that integrate chemical synthesis constraints to further increase throughput.

Senolytic discovery

Our primary biological focus is on cellular senescence, a key process linked to aging. We use the pipelines described above to design senolytics — drugs that selectively target and eliminate senescent cells. Of the various hallmarks of aging, cellular senescence is particularly amenable to high-throughput screening due to its ease of detection through image-based methods. Working closely with our biology collaborators, we conduct both target-based and phenotypic-based screens to discover novel senolytics.