Data Science

May 13, 2026 • 1 min read

Peak Performance or Just Noise?

It's easy to look at machine learning leaderboards and assume raw scores tell the whole story. In this post, I compare statistical methods to cut through the noise and help us spot genuine model superiority.

Read article

Nov 26, 2025 • 17 min read

Working with Large Virtual Chemical Libraries: Part 3 - Thompson Sampling for Classification

Exhaustively screening billion-compound virtual libraries would take decades, so we need smarter ways to hunt for molecules. I look at how we can adapt Thompson Sampling, a classic reinforcement learning technique, using the Beta distribution to efficiently find active compounds without breaking our computers.

Read article

Nov 9, 2025 • 7 min read

Interpretability vs. Explainability in Cheminformatics

Interpretability and explainability are different concepts in machine learning, yet many cheminformatics authors use the terms interchangeably.

Read article

Sep 12, 2025 • 12 min read

Chemprop-RF: A Hybrid Approach to Chemical Property Prediction

Can we combine d-MPNNs and Random Forests to outperform each of them individually?

Read article

May 3, 2025 • 11 min read

Drug Repurposing Using Artificial Intelligence

Finding new uses for existing, approved medications is a massive shortcut in drug discovery. After bad weather ruined my weekend hiking plans, I sat down to build an open-source deep learning workflow to virtually screen clinical libraries for hidden hits.

Read article

Jan 22, 2025 • 7 min read

TabPFN for Chemical Datasets

TabPFN is a new transformer-based foundation model that claims to handle tabular data in a single, lightning-fast forward pass. I decided to put it to the test on several molecular property benchmarks to see how it holds up out of the box.

Read article

Jan 2, 2025 • 12 min read

Working with Large Virtual Chemical Libraries: Part 2 - Genetic Algorithms

When a virtual library is way too massive to screen one molecule at a time, genetic algorithms offer an elegant way out. In part two of this series, I explore how biologically inspired selection can navigate massive combinatorial spaces using just building block data.

Read article

Nov 1, 2024 • 2 min read

Displaying Distributions with Raincloud Plots

Every time I used violin plots in presentations, the feedback turned into a debate over whether they looked like sea creatures or medieval weapons. If you want a cleaner way to show your data, raincloud plots are an incredibly intuitive alternative that combines raw data points, box plots, and density curves beautifully.

Read article

May 18, 2024 • 15 min read

Working with Large Virtual Chemical Libraries: Part 1 - Active Learning

If a computational scoring function takes just one second per molecule, screening a billion-compound library would take nearly 32 years. In part one of this series, I look at how we can use active learning loops to train a machine learning model, allowing us to intelligently hunt down the highest-performing molecules without exhaustively testing the whole library.

Read article

May 5, 2024 • 14 min read

I Want to Become a Data Scientist, but I Have No Idea Where to Start...

When I first started looking into retraining for data science, I felt completely lost and unguided. I wrote this post to share the exact things I wish I’d known before setting out, from picking the right Python courses to navigating bootcamps and getting that first bit of real-world experience.

Read article