Jon Swain
AboutTopics

Data Science

  • May 13, 2026 • 1 min read

    Peak Performance or Just Noise?

    It's easy to look at machine learning leaderboards and assume raw scores tell the whole story. In this post, I compare statistical methods to cut through the noise and help us spot genuine model superiority.

    Read article

  • Nov 26, 2025 • 17 min read

    Working with Large Virtual Chemical Libraries: Part 3 - Thompson Sampling for Classification

    Exhaustively screening billion-compound virtual libraries would take decades, so we need smarter ways to hunt for molecules. I look at how we can adapt Thompson Sampling, a classic reinforcement learning technique, using the Beta distribution to efficiently find active compounds without breaking our computers.

    Read article

  • Nov 9, 2025 • 7 min read

    Interpretability vs. Explainability in Cheminformatics

    Interpretability and explainability are different concepts in machine learning, yet many cheminformatics authors use the terms interchangeably.

    Read article

  • Sep 12, 2025 • 12 min read

    Chemprop-RF: A Hybrid Approach to Chemical Property Prediction

    Can we combine d-MPNNs and Random Forests to outperform each of them individually?

    Read article

  • May 3, 2025 • 11 min read

    Drug Repurposing Using Artificial Intelligence

    Finding new uses for existing, approved medications is a massive shortcut in drug discovery. After bad weather ruined my weekend hiking plans, I sat down to build an open-source deep learning workflow to virtually screen clinical libraries for hidden hits.

    Read article

  • Jan 22, 2025 • 7 min read

    TabPFN for Chemical Datasets

    TabPFN is a new transformer-based foundation model that claims to handle tabular data in a single, lightning-fast forward pass. I decided to put it to the test on several molecular property benchmarks to see how it holds up out of the box.

    Read article

  • Jan 2, 2025 • 12 min read

    Working with Large Virtual Chemical Libraries: Part 2 - Genetic Algorithms

    When a virtual library is way too massive to screen one molecule at a time, genetic algorithms offer an elegant way out. In part two of this series, I explore how biologically inspired selection can navigate massive combinatorial spaces using just building block data.

    Read article

  • Nov 1, 2024 • 2 min read

    Displaying Distributions with Raincloud Plots

    Every time I used violin plots in presentations, the feedback turned into a debate over whether they looked like sea creatures or medieval weapons. If you want a cleaner way to show your data, raincloud plots are an incredibly intuitive alternative that combines raw data points, box plots, and density curves beautifully.

    Read article

  • May 18, 2024 • 15 min read

    Working with Large Virtual Chemical Libraries: Part 1 - Active Learning

    If a computational scoring function takes just one second per molecule, screening a billion-compound library would take nearly 32 years. In part one of this series, I look at how we can use active learning loops to train a machine learning model, allowing us to intelligently hunt down the highest-performing molecules without exhaustively testing the whole library.

    Read article

  • I Want to Become a Data Scientist, but I Have No Idea Where to Start... May 5, 2024 • 14 min read

    I Want to Become a Data Scientist, but I Have No Idea Where to Start...

    When I first started looking into retraining for data science, I felt completely lost and unguided. I wrote this post to share the exact things I wish I’d known before setting out, from picking the right Python courses to navigating bootcamps and getting that first bit of real-world experience.

    Read article

Subscribe

I am a data scientist and cheminformatician, originally from the UK, but often found in Aotearoa (New Zealand). I'm interested in using data science and machine learning to solve problems in drug discovery. When not in front of a computer, I can usually be found in the mountains or on the water.