Want to learn data science? Check out these 8 easy steps to set out in the right direction!
One of the most popular questions in the data science field, apart from 'What is data science?', is 'How do I learn data science?'. It's not just a question that comes from those who are new to data science, but also from those who have already been around for some time. The road to the "sexiest job of the 21st Century" or the "best job of the year" 2016 is clearly not as smooth or straightforward as one would think.
At DataCamp, our students learn data science
by doing. But we have also noticed that they continue to ask these questions. You can find a lot of opinions and advice from seasoned veterans on the Internet, but this jungle of information is not making things easier for beginners. This post is meant to present a general overview of the eight steps that you need to go through to learn data science.
The goal is not to give an exhaustive list, but rather to make this a guide for everyone that is interested in learning data science or for everyone that has already become a data scientist or part of a data science team but wants some additional resources for further perfection.
If you prefer a visual representation of this blog post, make sure to check out the corresponding infographic "Learn Data Science â?? 8 (Easy) Steps".
What Is Data Science?
Data science is still a fuzzy concept. There are and have been many definitions or attempts at definitions around, and it doesn't need to surprise that some of these have been visually represented. The most significant start of this trend or tradition was in 2010 when Drew Conway presented a Venn diagram
to define the concept "data science". In the center of the picture is data science and it is the result of the combination of hacking skills, mathematics and statistics knowledge and substantive expertise.
Over the years, there have been many Venn diagrams or other visual representations that circulated throughout the data science industry, one more successful than the other. For a chronological overview of the most significant ones, check out the article Battle of the Data Science Venn Diagrams
To make a long story short, in 2016 we got a slightly different image of what data science is. Matthew Mayo blogged a visual representation made by Gregory Piatetsky-Shapiro. There are a lot of things that are different. Two things that stand out are the fact that data science is no longer in the center of the picture and that the approach to defining data science is different. Data science is now defined through its relation to other disciplines, such as Artificial Intelligence (AI), Machine Learning (ML), Deep Learning, Big Data (BD) and Data Mining (DM). Data science is at the crossing of AI, ML, and BD and has an intrinsic relation with DM, as it is considered the superset of data mining and its successor term.
These two visuals might seem completely different, but they do share a lot of similarities: the disciplines that are visualized in Piatetsky-Shapiro's picture all require hacking skills, mathematics and statistics knowledge, and substantive expertise or domain knowledge.
Data Scientist's Educational Background
There have been a lot of surveys over the past few years on the educational background of data scientists. As a result, there have also been many different results. In the O'Reilly Data Science Salary Survey of 2014
, about 28% of the respondents had a Bachelor's degree, while 44% had a Master's degree and 20% had a Ph.D. Common field that data scientists have as backgrounds are mathematics/Statistics, Computer Sciences, and Engineering. The results that are represented in the infographic are from 2016
. They are very similar to the ones of the O'Reilly survey.
In general, you could conclude that the degree that you need to have completed to become a data scientist is usually a Master's degree or Ph.D. The field that you come from is of less importance, but you have an advantage if you have a quantitative background.
Step 1. Get Good at Stats, Maths and Machine Learning
The perspective on the definition of data science might have changed over the years, but data science has remained a somewhat technical occupation. Sound knowledge of statistics, mathematics, and machine learning are still considered the main requirement for anyone to do data science.
Getting up to speed with these three can be a pain, especially for those who have no technical background whatsoever. Luckily, you have more than enough qualitative resources to help you out on this: Khan Academy
offers online courses on a variety of mathematics topics that will undoubtedly be of great value to you, but make sure to also take a look at the Linear Algebra course from MIT Open Courseware
. For statistics, DataCamp
s material might help you, and for Machine Learning, you should keep an eye out for the content on DataCamp, Stanford Online, and Coursera
Step 2. Learn to Code
Developing your hacking skills is also one of the things that you need to take into account still if you want to learn data science.
You can start by getting familiar with the computer science fundamentals: get to know the basic data structures and search algorithms. Then, step up to understanding how end-to-end development works: the stuff you will work on will be integrated with other systems, so it's best to understand how development from beginning to end, from the requirements gathering and analysis to testing and maintaining code. When you have grasped this concept, it's time to pick a language. You can go for an open source language or a commercial one. Things to take account in your decision are the learning curve, the industry you want to work in, the salary that comes with being proficient in the language, â?¦
Make your choice easier with the help of this infographic. DataCamp is there to assist you if you have made chosen an open source programming language.
Step 3. Understand Databases
When you start out learning data science, you see that a lot of tutorials focus on you retrieving data from flat files. However, when you start working or when you get in touch with the industry itself, you see that most of the work happens through a connection with one or multiple databases.
And there are a lot of databases out there. Companies might work with commercial ones like Oracle or they might opt for open-source alternatives. The key to seeing the forest for the trees here is to understand how databases work. Learn about the why and how of databases and the what will come. Concepts that you should grasp and know your way around in are the Relational Database Management Systems (RDBMS) and data warehousing. That means that relational versus dimensional modeling should not hold any secrets for you, nor should SQL or the Extract-Transform-Load process (ETL) surprise you.
Step 4. Explore The Data Science Workflow
A next phase in the learning process would be to explore the data science workflow. A lot of tutorials or courses focus on only one or two aspects of it but lose the general overview of the process that you will need to go through once you're working as a data scientist or in a data science team. It's essential not to lose sight of the iterative process that data science is.
For data science, beginners that know how to program, the easiest way to discover how the data science workflow works are by practicing your coding skills: get started on your journey with R or Python. There are several packages and libraries that you designed to make your coding life easier. Check out the infographic snippet below:
For those beginners who still feel that their hacking skills are lacking, it's worth checking out the open-source alternatives that don't require you to code everything. These tools will allow you to do more than one step in the data science workflow at the same time. For example, RapidMiner allows you to import or collect your data, do some operations on it to clean it, model and evaluate it. Note that it's good to know how to work with these tools but that you should keep on working on your coding skills!
Step 5. Level Up with Big Data
Many learners are so concerned with what they call "the fundamentals" of data science that they forget the bigger picture out there. Literally. You have had some hints in the previous sections about this, but there is a discrepancy. Just like the discrepancy between the flat files that you use in many tutorials and the databases that are used in the industry, the velocity, variety and volume of the data that is out there. It's a reality that you cannot nor should not miss.
Big data might have been a hype, but it's definitely out there, and it's important to realize this and understand what it encompasses. Three things to learn about big data are:
- See why big data requires a different approach of data processing. The best approach to do this is probably by looking at big data use cases. You can read up on some here.
- Get familiar with the Hadoop framework: it's widely used for distributed data storage and processing.
- Don't forget about Spark. Getting the hang out of Spark in combination with Python or Scala is the way to go. And, even better, you kill two birds with one stone: you practice your coding skills and widen your view on data science.
Step 6. Grow, Connect and Learn
Grow. Once you have gotten to this point where you already master the fundamentals, it's time to grow: practice as much as you can by doing data science challenges, like the ones you find on Kaggle
. They will definitely challenge you to put the theory into practice. Also, you should also let your intuition grow.
Connect. As a data science learner, you might fall into the pitfall of staying occupied with your learning and that of other learners, but it is equally important to connect to those who already have some more experience in the field. This way, you build up a network to fall back on in case you have questions, need advice or tips, or whatever. These people will motivate you to keep up the good learning and will challenge you to go even further.
Learn. Continuous learning and data science could be synonyms. The Kaggle and DrivenData challenges that have been mentioned above will teach you a thing or two about how data science is done in practice. Apart from these relatively small exercises, you might consider starting up a pet project and explore some things even on a deeper level.
Step 7. Immerse Yourself Completely
Just like a language bath, you're in need of a data science bath. Depending on your skills and knowledge that you already have, you might consider a boot camp, an internship or a job. A boot camp is an amazing way of kickstarting or boosting your data science learning. As a plus, you meet a lot of people, and you have an opportunity to build or extend your network. Are you having trouble finding one? Check out Galvanize
, but also don't forget that your Meetup
Groups might also organize boot camps and workshops for the community!
Secondly, when you have already got the basics of data science under control, you should consider getting an internship. A lot of the big companies like Facebook
have looked for interns before, so this is a great place to start your search. Also, you can use your social channels or your network to get first-hand information on open positions for internships. Lastly, also take a look at startups: these smaller companies can be willing to let you learn on the job as long as you learn quickly. AngelList
is worth checking out for startup jobs.
The last immersion option is where most learners experience a bottleneck, as the recent search trend in "Data Science Interviews" confirms. Even though you might be very enthusiastic about a job as a data scientist, it's essential to keep a couple of things in mind when you're looking for a job:
- The job postings don't always have the roles right. They might post for a "Data Scientist" position, but in reality, they're looking for a data engineer or business analyst. Check out DataCamp's The Data Industry: Who Does What infographic to see what companies look for when they post open positions.
- Set your expectations straight: starting in a data scientist or analytics position if you haven't had any real-life experience with the data science workflow, databases or end-to-end development, is not realistic. Make sure you have relevant experience to show when you're applying.
Don't get discouraged if you can't get the job immediately. Instead, try to make sure that you keep busy and build our experience and keep an eye out for the companies that have posted data science positions before, like Google, Microsoft, and Twitter.
Step 8. Engage with The Community
This last step is one that can be overlooked sometimes. Even when you have a job in data science or as a data scientist, you still need to remember that data science equals continuous learning. There are new advancements all the time, and it's of key importance to stay informed and curious about what's happening around you. So don't hold back to contribute to discussions on social media, subscribe to a newsletter, follow the key people of the data science industry, listen to a podcast, â?¦ Whatever you can do to engage with the community!
To stay up to date with the latest news, you can register to the following newsletters: the bimonthly KD Nuggets newsletter and Data Elixir
or the Data Science Weekly
newsletters. Next, follow some of the key people in the data science industry on Twitter. This will also keep you up to speed with the latest. Just some of the people that might interest you are DJ Patil
, Andrew Ng
, and Ben Lorica
Join some communities online. LinkedIn, Facebook, Reddit, ... They all offer the possibility to connect with peers. You should take on the opportunity to become a member of one of those groups:
- On LinkedIn, make sure to take a look at the "Big Data, Analytics, Business Intelligence", "Big Data Analytics", "Data Scientists" or "Data Mining, Statistics, Big Data, Data Visualization, and Data Science" groups.
- At Facebook, the "Beginning Data Science, Analytics, Machine Learning, Data Mining, R, Python", "Learn Python" groups might interest you.
- Subreddits that you can keep an eye on are "/r/datascience", "/r/rstats" and "/r/python", among many others!