This article was originally published on Towards Data Science on June 8th, 2020.
There’s no doubt that data science requires decent programming skills, but how much is enough? Should you know just as much as an average software engineer? This article aims to answer this question, and much more.
As a one-sentence summary — no, knowing to program on the level of a mid/senior backend developer is not required. Aim to know more than average statistician and you’ll be fine. There’s always time to learn more as you progress in your career.
The article is divided into three sections:
- How much programming is needed in data science?
- Which programming language to pick?
- Resources I recommend to get started
Please keep in mind — in the article, you’ll find affiliate links to the recommended resources to get started. That doesn’t mean anything to you, as the price is identical, but I’ll get a small commission if you decide to make a purchase. Also, I only show materials I’ve gone through myself and can guarantee 100% for the quality.
Without much ado, let’s get started with the first section.
How much programming is needed in data science?
Well, a lot — but that depends on the role and the company you work for. Small companies don’t necessarily have structured teams for both development and data science, so it’s required to be comfortable with both.
In a nutshell, you won’t be the best in programming nor in data science. That’s not necessarily a bad thing, as you’ll get a better grasp of the product/service the company offers.
Larger companies will treat you differently, due to a more formal structure. You’ll handle data science problems only (as a data scientist), and more often than not won’t see the big picture. You’re there to do the job — not to ask too many questions.
Keep in mind that this is just a rule of thumb — drawn from my experience and from many others.
Which programming language to pick?
It’s not an easy question, to be honest. Most websites mention Python and R as a go-to languages, but those aren’t the only options.
Some companies need a data science solution, but don’t have any data scientists onboard — software development companies centered around the web/mobile development.
While Python and R are great, I find more and more resources on solving machine learning tasks with Java, or even with Go(lang). Heck, I’ve even written a whole article on this topic:
I’m not saying languages like Java and Go are great for prototyping, but they are still a viable option for a software developer that doesn’t know Python or just doesn’t want to use it. As I’m diving deeper into software development, or developing applications that use machine learning, I can get why someone wants to stay away from Python.
- Learn Python/R if you only care about data science and machine learning
- If you are a software developer and don’t want to switch languages, you can try Java and Go (among other languages)
Resources I recommend to get started
My guess is that you’ve chosen the Python route, and that’s great for several reasons:
- The language is simple to learn — more beginner-friendly than Java/Go
- It’s the most widely used language for data science
- It’s a general-purpose language — not limited to statistical tasks
As an aspiring data scientist, Python will suit you just fine. There’s no need for you to explore other, more difficult languages as coding shouldn’t be your primary concern.
But how to get started? I’ve got two amazing books for you which helped me to learn Python well, both pure programming-wise and for the data analysis tasks. Let’s start with the basics.
Learning Python, by Mark Lutz (O’Reilly)
It’s an awesome first book — no arguing there. Be aware — it’s almost a 1500 page read, so don’t expect to finish it in one day.
Despite its length, I think it’s an essential book to learn and master the language. It covers every aspect of the language in an easy to follow manner.
Some of the major topics are data types, statements, loops, functions, function scopes and arguments, modules, classes and object-orientated programming, exceptions, generators, decorators, and much more advanced topics. As I’ve said, it’s not an overnight read, but you should be able to go through it in 2–3 months. That’s more than enough time to get the fundamentals covered and be ready to move to more advanced and practical topics.
That’s where the next book comes in.
Python for Data Analysis, by Wes McKinney (O’Reilly)
As you would expect, this is a logical next step for an aspiring data scientist. This time we have a much shorter book — around 500 pages. You can definitely cover it in a month if you set it as a priority.
The first 100-ish pages are a refresher on the Python programming language, so feel free to skip it.
After that point, the book covers pretty much everything you’d expect to get from a great data analysis book. The fundamental libraries like Numpy and Pandasare covered well, both through basic examples and later through more realistic tasks of data cleaning and preparation.
The book also goes through data visualization and handling time series, which is a nice bonus but not something you should buy the book for — as there are better options for those topics.
Overall, a great read, and a nice follow-up from the first book.
Before you go
Learning programming isn’t the most easier task — but is a must for professions like data science. How much programming you’ll do will depend on the type of company you work for — expect a more developer-orientated environment in smaller companies and the opposite in large companies.
There are always exceptions, but I found this to be a good rule of thumb, coming from my experience and from the experience of many others I’ve talked to. This rule doesn’t make any sense if we’re talking about small AI startups, so keep that in mind.
With regard to the language, Python is a great place to start. It’s easy to learn and gets the job done. If you are a web/mobile developer and don’t want to learn Python, Java and Go have decent options for machine learning.
For others, learning enough Python and data analysis is enough to handle more difficult problems with ease, so make sure you get the basics covered. The two books from above should work wonders.