Peak Performance or Just Noise?

Sorry it’s been quiet on here for the past 6 months. I’ve been lucky enough to start working at OpenADMET, part of the Open Molecular Software Foundation (OMSF). Part of my work has been on blind challenges and I’ve written a blog post for OpenADMET on leaderboard analysis.

You can find the blog post here.

When we look at machine learning leaderboards, it’s easy to treat raw scores as definitive rankings. However, in small-molecule drug discovery and blind challenges like the OpenADMET-ExpansionRx challenge, the performance margins between top entries are often smaller than the underlying noise of the data.

In this post, I dig into the statistical pitfalls of standard ranking approaches (and how easily we can accidentally “p-hack” our benchmarks using standard bootstrap methods). I then evaluate robust alternatives, specifically Paired Bootstrap Confidence Intervals and Permutation Testing, to show how we can reliably separate genuine model superiority from pure luck.

It ended up being quite long, so I’d recommend making a hot drink before you settle down to read the whole thing!

Statistical Comparison of Blind Challenge Entries

Related Posts

Share This Post