Data scientists have become the darlings of today's competitive job market. Entry-level salaries can range into six figures, and roughly 700,000 job openings are projected by 2020. There's good reason for this spike in demand, too. The job of data scientists is to extract insights which are hidden inside mountains of data, which can then be used to achieve diverse business goals ranging from fraud detection to face recognition. Far from being uniform, the field of data science is now as diverse and varied as the business goals which it helps achieve. Acknowledging this is key to building data science teams, which must be comprised of individuals with highly specialized (and complementary) skill sets in order to be successful.
The specialized, complex nature of data science work poses a significant problem for hiring. In fact, there is still genuine confusion in the job market about what the term "data scientist" actually means. At Ancestry, where we're working with an enormous 10 petabytes database made up of millions of DNA samples and family trees, we resolve this confusion by specifying three major roles within the broader data science organization.
Organizational Structure For Data Science
There are often very specific technical requirements that different roles within the data science organization demand, but there needs to be a common understanding of what is required for a data science team to be successful. While the type of technical skill set is critical for a successful data science team, more importantly, the success is dependent on how the team is structured.
At Ancestry, our main goal is to provide meaningful, valuable insights to consumers about who they are, how they connect to today's society and how they have been shaped by human history. We have three categories of individual contributors within the broader data science organization that make these findings possible: data scientists, data engineers and machine learning (ML) engineers. We define these roles as follows:
The emergence of big data has allowed businesses to answer a wide-ranging set of fundamental questions that were previously unanswerable. Roughly speaking, the size and richness of big data have enabled the identification of structure within the data itself. For instance, one might ask, "What are the odds that a given customer opens a promotional email?" One can imagine that this probability can be estimated by way of its relationship to the customer's particular characteristics and that this relationship can be derived from copious data - one approach might be to quantify the average behavior of all customers who share similar characteristics. Identifying this relationship and the relevant characteristics is the job of the data scientist.
In general terms, data scientists produce mathematical models for the purposes of prediction. In the example above, we are predicting whether a user opens a future promotional email. That's why most data scientists have graduate-level training in computer science, mathematics or statistics, as the development and interpretation of mathematical models require deep technical knowledge. Data scientists also need strong programming skills in order to effectively leverage the range of available software tools. Aside from technical savvy, data scientists also need strong common sense thinking and business understanding in order to produce high-quality models. Ancestry hires data scientists on the basis of these three characteristics - technical knowledge, programming skills and common sense. Our large and growing data science team now supports all lines of our business.
The job of data scientists is impossible if the requisite data is not available and daunting if the data is available but inconsistent. The problem of inconsistency is frequently faced by data scientists, who often complain that too much of their time is spent on data acquisition and cleaning. That's where the data engineer comes in. Their role is to create consistent and easily accessible data pipelines for consumption by data scientists. In other words, data engineers are responsible for the mechanics of data ingestion, processing and storage, all of which should be invisible to the data scientists.
Data engineers don't need to know anything about machine learning or statistics to be successful. They don't even need to be inside the core data science team. However, they must be dedicated to it and available on call.
Machine Learning Engineers
So far, we have described data scientists, who build mathematical models, and data engineers, who make data available to data scientists as the "raw material" from which mathematical models are derived. To complete the picture, these models must be deployed - that is, put into operation - to produce business value. This task is the purview of the machine learning engineer. This is a software engineering role, distinguished by the requirement that the ML engineer has considerable expertise in data science. This expertise is required since ML engineers bridge the gap between the data scientists and the broader software engineering organization. With ML engineers dedicated to model deployment, the data scientists are free to continually develop and refine their models. The ML engineers are best set up if they are part of the core data science team.
The Right Choice
To sum up, if your business agenda involves getting better at predicting outcomes or understanding relationships, data science can help. Building a team with data scientists who have the right type of skill sets: dedicated ML engineers who help data scientists deploy models in production and dedicated data engineers who help data scientists by making the data they need available to them, is the best way to get business results. No one needs an algorithm to figure out that in business, results are what count.
The article was originally published here