-
TabPFN for chemical datasets
Deep Learning models have traditionally performed well on unstructured data such as text and images, but poorly on structured tabular data, and are usually outperformed by Gradient Boosted Decision Trees (GBDTs) on tabular chemical data. TabPFN (Tabular Prior-data Fitted Network) is a transformer-based foundation model for tabular data, pre-trained on millions of synthetic datasets to solve supervised learning tasks, with state-of-the-art performance on benchmarks. But does it work for cheminformatics?
-
Working with large virtual chemical libraries: Part 2 - Genetic algorithms
This is part 2 of a a planned three post series on working with large chemical libraries. The notebook used to create this post, and all the files can be found in this github repo.
-
Displaying distributions with raincloud plots
I’ve tried to visualise and compare distributions using violin plots for reports and presentations in the past, and the feedback I’ve got was generally… not great. When searching for better methods I came across this excellent blog post by Alex Belengeanu on raincloud plots and I’m now a big fan.
-
Working with large virtual chemical libraries: Part 1 - Active learning
This is part 1 of a planned three post series on working with large chemical libraries. The notebook used to create this post and all the files can be found in this github repo.
-
I want to become a data scientist, but I have no idea where to start...