Which Skills are most Valuable in Machine Learning?
The most valuable contributors are often generalists. Earlier, there was a lot of hype around particular machine learning methods. Candidates who have learned how to use a certain deep learning package in an online course and are applying to jobs remind me of people in the 1990s, when there was similar hype around the web, who read the "Learn VBScript in 20 Days" kinds of books instead of learning the fundamentals of computer science.
The skills that have remained important are (a) understanding the fundamentals of statistics, optimization, and building quantitative models and (b) understanding how models and data analysis actually apply to products and businesses.
Knowing how to write high quality software the days of one team writing throwaway models and another team implementing them in production are slowly coming to an end. With programming languages like Python and R and their packages making it easy to work with data and models, it is reasonable to expect a data scientist or machine learning engineer to attain a high level of programming proficiency and understand basics of system design.
Working with large data sets. While "big data" is a term used way too often, it is true that the cost of data storage is on a dramatic downward trend. This means that there are more and more data sets from different domains to work with and apply models to.
And yes, knowing something about at least one of the popular areas of the field that have gotten traction lately deep learning for computer vision and perception, recommendation engines, NLP would be a great thing once you have the fundamental understanding and technical proficiency.
This is perhaps the most fundamental of a data scientist's skill set - the job of a data scientist is much more applied than that of a traditional statistician. Programming is important in multiple ways, including the three below:
Being able to program augments your ability to do statistics. If you have a bunch of statistics knowledge but no way to implement it, your statistics knowledge becomes much less useful. The ability to analyze large datasets. The datasets you get to work with in industry are not as small and cute as the sample iris dataset -you easily get data that reaches millions of rows and many more. You can create tools to do better data science. This includes everything from building systems that your company can use to visualize data, creates frameworks to automatically analyze experiments, and managing the data pipeline at your company so the necessary data can be in the right place by the right times.
2. Quantitative analysis
Quantitative analysis is heart of a data scientist's skill set. Much of data science is about understanding the behavior of a particularly complex science System by analyzing the data that it produces, both naturally and via experiments. The need for quantitative analysis skills are important in multiple ways, including the three below:
- Experimental design and analysis: Particularly for data scientists working on consumer internet applications - the way that data is logged and the way that experiments can be run gives way to a massive amount of experimentation to test various hypotheses. There's a lot of ways that experiment analysis can go wrong (ask any statistician), so data scientists can help a lot here.
- Modeling of complex economic or growth systems: Typical models like churn models or customer lifetime value models are common here, as well as more complicated models such as supply + demand modeling, economically-optimal ways to match providers and suppliers, and methods to model the growth channels of a company to better quantify which growth avenues are the most valuable. The most famous example of this is Uber's surge pricing.
- Machine Learning: Even for the data scientists that don't implement Machine Learning models themselves, there is tremendous value that data scientists can provide in helping create prototypes to test assumptions, select and create features, and identify areas of strength and opportunity in existing machine learning systems.
The requirement of this skill is why in particular the data science field is attractive to 1. Physicists 2. Statisticians 3. Economists 4. Operations Researchers 5. Many more, who are very used to understanding complex systems through top-down approaches (making models) or bottom-up approaches (inferences from data).
3. Product intuition
Product intuition as a skill is tied to a data scientist's ability to perform quantitative analysis on the system. Product knowledge means understanding the complex system that generates all of the data that data scientists analyze. This is incredibly important for quite a few reasons, including:
- Generating hypotheses: A data scientist who understands the product well can generate hypotheses about ways the system can behave if changed in a particular manner. Hypotheses are based on hunches about how certain aspects of the system can behave and one needs to know about the system to be able to have hunches about how it works.
- Defining metrics: The traditional analytics skill set includes defining key primary and secondary metrics that the company can use to keep track of success at particular objectives. A data scientist needs to know about the product in order to create product metrics that both 1. Measure what is intended 2. measure something that is worth moving.
- Debugging analyses: Results that are "incredible" are more often caused by bugs than actually incredible features of the system. Good product knowledge can help with quick sanity checks and back-of-the-envelope calculations that can help more quickly identify things that might have gone wrong.
Product knowledge usually involves using the product that your company is creating. If that's not possible, then at least trying to get to know the people who actually use the product.
This skill is important to help significantly increase the leverage of all of the previous skills listed. This one is particularly important and can help distinguish a good data scientist from a great one. Good communication can manifest in various ways, including:
- Communicating insights: Some data scientists call this "storytelling". The important thing here is to communicate insights in a clear, concise, and valid way, so that others in the company can effectively act on those insights.
- Data visualization and presentation: Sometimes theres nothing more effective and satisfying than a good graph at making or conveying a point.
- General communication: Working as a data scientist almost always means working as a team - including working with engineers, designers, product managers, operations, and more. Good general communication can help facilitate trust and understanding, which is incredibly important for someone who is entrusted with being stewards of the data.
This last skill ties together the rest of the 4 skills. A data scientist in particular cannot exist in isolation, and from what I've seen does best when deeply embedded in the rest of the company (or at least within the product development org).
Teamwork is important for many reasons, including:
- Being selfless: This includes offering help and mentorship to others, and putting the company's mission before your own personal career ambitions.
- Constant iteration: A data scientist thrives on feedback, and most parts of the data scientist's work will involve back-and-forth iteration and feedback with others to reach an impactful solution.
- Sharing knowledge with others: Since the data scientist profession is quite new, there is basically no one with the complete set of skills, especially if you collect together all of the possibly useful statistical techniques, frameworks, libraries, languages, and tools. Because knowledge will be spread out across the data scientists and the organizations, it is particularly useful for data scientists to be constantly sharing their knowledge, methods, and results with each other.