I want to become a data scientist, but I have no idea where to start...

A triptych of a research chemist retraining as a data scientist, generated by Dall E 3

Time for a change

In 2020 I was working as a synthetic chemist in a laboratory in Aotearoa (New Zealand). I was getting disillusioned with the work (and academia in general) and started looking for a new career direction and new challenges. After chatting with a few friends who were working as data scientists I realised that my current interests in scientific computing, statistics, and drug discovery would transfer nicely to a job as a data scientist or cheminformatician.

It felt like an emerging field within biomedical research, with huge potential. Alphafold 2 had just been released and was causing a lot of excitement, but the LLM hype was still yet to fully start (ChatGPT was still a few years away). It seemed like a perfect time to make the jump to a new, exciting field that would grow in the future.

When I started searching for information about making the switch I somehow felt both completely lost and unguided, and overwhelmed by the resources out there. Part of the thinking behind this post was: What do I wish I knew before setting out on this journey?

I’ve been asked a couple of times about making the change, mostly by old colleagues looking to move from chemistry research to data science, but I hope this advice is useful whatever your background is. I’m always happy to answer questions but thought it would be useful to have something to point people to that contained my main ideas and could be used as a starting point for further questions.

There’s a huge amount resources out there, and no right or wrong way to make the change. This just lists my experience and what worked for me, this will be different for everyone. One of my favourite things about data science is the low bar for entry, there’s so much open-source software and data, YouTube videos, and Kaggle competitions that you only really need a basic computer to get started. On top of this I realise I was also lucky to have friends and colleagues that I could ask for advice, and a supportive family who were willing to house me whilst I re-trained. Through 2020 and 2021 we were occasionally confined at home due to COVID lockdowns (though thankfully in NZ these were fairly limited compared to the rest of the world), and with not much else going on this was an idea time for some re-training.

Learning Python

As I asked around my network for advice on making this change, the first suggestion was always the same, learn to code in Python. So much data science work is done using Python libraries (pandas, numpy, scikit-learn, pytorch, and rdkit for chemistry). I now spend a significant part of each day writing and reading Python code and documentation.

I did what I often do (probably a bit too much) and scrolled reddit. I kept seeing two online Python courses recommended. Automate the Boring Stuff with Python Programming by Al Sweigart, and 100 Days of Code: The Complete Python Pro Bootcamp by Dr. Angela Yu. Both were available on Udemy, which seems to regularly have special offers and I completed both courses for around $10 (US) each. There are definitely similar courses available for free so it’s a personal preference what to go for. I chose the paid courses for two reasons:

I generally found them to be more organised with a clear progression, including problem sets, projects, and with good coverage of all the important topics.
With a free course it’s easy to stop, especially if something comes up that stops you doing it for a few days and you get out of the habit (I can be pretty lazy). By paying for the course (even a fairly low cost of $10) was enough to manipulate myself into finishing (I’m getting my money worth).

Automate the boring stuff

The first course I completed was Automate the Boring Stuff with Python Programming by Al Sweigart. It’s a fairly short course but very well taught, designed for anyone who works regularly with a computer and wants to automate repetitive tasks, so they never have to do them again. It doesn’t go too deep into theory or style and gets stuck in with practical applications quickly. Within a couple of days, I was building small programs and could start to see how I could use it for to automate parts of my job. Before starting I had no idea if I would even enjoy programming in Python, and this was an ideal introduction (turns out I did enjoy it) and gave me enough understanding to look at other courses that might be useful. It seemed perfect to make my current job less boring, but not enough for the new job I wanted, so I started looking for something more in-depth.

100 days of code

The next course I completed was 100 Days of Code: The Complete Python Pro Bootcamp by Dr. Angela Yu. This is a much more in-depth course that covers a huge range of applications for Python. Every day has a couple of hours of lectures that cover theory or usage, and a few problem sets to apply what you’ve learnt. Every 10 days there’s a small project that uses everything you’ve learnt, and at the end there’s about 20 days of projects to construct a personal Python portfolio. I found some sections a little confusing (the first time OOP is introduced), and there’s a large section borrowed from a course on web development that I felt dragged on a bit (it’s useful to understand HTML, CSS, and JS for things like web-scraping, but I didn’t need quite that much detail on building websites). There’s a significant section towards the end on data science that was particularly useful. Overall, this course probably made the biggest difference in my re-training.

Other resources

CS50 Python from Harvard University: I haven’t completed this but if the quality is comparable to CS50 and CS50 SQL (discussed below) it’ll be a very good course for learning Python. David Malan is a fantastic lecturer and each week there’s problem sets. It’s free to complete online, but you could pay for a certificate if you wanted to.
Codewars: Lots of coding problems with a range of difficulties. I used these at the start to practice my Python programming, and still occasionally use the SQL problems to brush up as I don’t use SQL regularly at my job.
Advent of Code: A daily problem during advent that gets progressively more difficult. Can be completed in any programming language.
Fluent Python: For a more advanced understanding of Python

Data science experience

I can't get a job because I don't have experience because I can't get a job

At this stage I started thinking about jobs, but everyone I spoke to gave the same response: “What you’re doing is the right idea, but we’re just looking for someone with more experience.” This seems to be a classic problem when re-training, how do you get your first experience when everyone wants someone with experience? There are a few ways to go about this, but the main two seem to be either an internship or a bootcamp. Both have pros and cons, and I decided to go down the bootcamp route. To me the main advantages were the ease of organisation (I was relocating to the UK at the time and having something waiting for me was nice), and the shorter duration (most internships seemed to be multiple months, whereas the bootcamp I did was 5 weeks).

S2DS

At the recommendation of a friend, I applied to Science to Data Science (S2DS), a 5-week, intensive, project-based bootcamp for research scientists with a PhD or MSc looking to move into data science. The application involved a short technical project to test your exploratory data analysis skills and a behavioural interview (that seemed mostly to filter out psychopaths).

For the 5 weeks you’re part of a small group that gets paired with a company, working on a data science project. There’s a couple of talks in the first week about best practices, but you’re working on your project straight away. My team was paired with Deutsche Welle (DW), a German broadcaster. They were interested in the gender breakdown of people mentioned and quoted in their articles, so the project involved Natural Language Processing (NLP) and Named Entity Recognition (NER), as well as using techniques like web-scraping for data gathering and machine learning classifiers for predicting gender from the name.

Working in a team made the experience a lot of fun. We all came from a range of backgrounds and had different skillsets, but all started the bootcamp from similar positions so were learning together. It also gave a first experience using git and github collaboratively. In addition to the team, we had a mentor from S2DS who helped with problems, and contacts with DW who helped with project direction.

When I applied S2DS was £800 for the entire course, not cheap, but much cheaper than many bootcamps I saw advertised online. I made the calculation that getting a job and starting to get paid more quickly would pay off in the long run. I’ve heard bad things about some bootcamps found online, so always do your research before signing up. If you’ve got the connections or are willing to put the work in organising it yourself, an internship is another great way to gain experience without the up-front cost.

My experience with S2DS was positive, and that seems to invariably be the feeling among other alumni. Having this experience on my CV and a project to talk about in interviews was invaluable during my job hunt. If you do decide to sign up, feel free to put me as your reference, I think I get some Amazon vouchers or similar. One unique advantage from completing S2DS is the ongoing career advice and the community of alumni, which will be useful going forwards in my career.

CS50x and CS50 SQL

Two other online courses I completed were CS50x and CS50 SQL from Harvard.

CS50x is Harvard’s introductory computer science course. It’s an 11-week course with lectures, short videos on specific subjects, and problem sets each week. Whilst not essential information for a data scientist, I feel having a much better understanding of how a computer works has made me a better data scientist. The first few weeks working through problems using C made me appreciate how easy Python is!

CS50 SQL is a 7-week SQL course with the same structure that teaches you all the basics you’ll need for a job in data science. Most of the 7 weeks are done using SQLite, but it later moves onto ProgreSQL and MySQL.

Other resources and ideas

Python for Data Analysis: Everything you need to know about the Pandas library.
Practical Statistics for Data Scientists: Since many data scientists moved into the field from other areas of research, they lack the statistical understanding required (me included), brushing on the statistical rigour required to properly understand results is a good idea.
Data Science from Scratch: A good introduction to Data Science using Python.
Personal projects: A portfolio on you Github is good for showing off your ability. After watching a few lectures from the FastAI course by Jeremy Howard I decided to create a computer vision classifier for chemical compounds.
Kaggle competitions: I’ve never actually competed in any, but they seem very popular.

Cheminformatics

With a background in chemistry research, cheminformatics seemed like a natural field for me. Giving me the opportunity to combine by experience in chemistry and drug discovery with my new skills in Python programming and data science. I looked for short online courses and textbooks to get experience with cheminformatics problems and gain an understanding of the basics.

I started with the Cheminformatics OLCC. This is an introductory course with 8 sections that cover the basics of cheminformatics. It starts with help setting up a Python environment, and covers topics such as representing molecules, chemical databases, QSAR modelling, and simple machine learning for chemistry.

I then worked through Deep Learning for the Life Sciences. As datasets in the life sciences get larger, deep learning becomes more powerful. This covers how deep learning is used on molecules, proteins, and nucleic acids, including code so you can follow along.

I also attended the AI4SD Machine Learning Summer School at the University of Southampton. This was a week of lectures on topics such as machine learning, github, and LaTeX, with a focus on applications in chemistry. The week ended with a hackathon where we worked as a team to solve a chemical property prediction problem.

TeachOpenCADD looks like another great resource. I’ve not yet had the chance to go through it in detail, but it has a wide range of Jupyter Notebooks on all sorts of cheminformatics topics.

I found attending meetings and conferences on AI and cheminformatics was a great way to learn about current areas of research and meet interesting people doing similar research. This list is mostly limited to my local area (Cambridgeshire, UK).

Cambridge Cheminformatics Network Meetings run every quarter with three speakers presenting their work. Free to attend in person or virtually via zoom. There’s a “networking” opportunity (a pub trip) afterwards.
UK QSAR meetings are twice a year and free to attend with high quality speakers.
The RSC AI in Chemistry Conference is a multi-day annual meeting with speakers from all over the world presenting the cutting edge of research.
Cambridge AI Club for Biomedicine - I’ve not been to this one yet, but the topics discussed at previous meetings look interesting.
Chalmers AI4Science Seminars are monthly virtual meetings where early-career researchers present their work using machine learning for scientific research.

There’s a load of useful blogs and newsletters for keeping up to date with cheminformatics:

DrugDiscovery.NET - AI in Drug Discovery - A newsletter from Andreas Bender. It contains interesting links, details of events, and job listings in cheminformatics.
Practical Cheminformatics by Pat Walters. Some useful posts for common cheminformatics projects and issues.
The RDKit Blog Greg Landrum. RDKit is probably the most useful Python library for a cheminformatician, this blog has some great tips.
Cheminfomania
Oxford Protein Informatics Group
Is Life Worth Living?

And a few useful journals to add to your RSS feed:

Summary

When I started writing this post, I didn’t expect it to be quite so long! Looking back at the process of re-training I realise how much work it took, and how lucky I was to have family to help support me through it all. I’m also incredibly glad I did it! My current job suits me so much more than the lab work I was doing before.

I think the general process I went through worked well, but there are still a few changes I would make if I could do it all again. I’ve heard that getting your first data science job has three equally important parts:

Your skills (e.g. Python, SQL, etc.)
Your portfolio and experience (e.g. Boot-camps, personal projects, internships)
Your network (Friends, colleagues, recruiters)

In the search for my first job, I over-prioritised improving my skills, spending a lot of time doing courses. Whilst this is important for succeeding at a job, getting the job requires a wider focus. If I were to do it again, I would spend more time on personal projects to put on my github, and networking with other data scientists, asking about what problems their companies have and how I might be able to help fix them. When I finally did get my first data science job, it’s no surprise it was on the recommendation of a previous colleague.