I recently changed industries and joined a data science startup company where I'm responsible for building up a data science discipline. While we already had a solid data pipeline in place when I joined, we didn't have processes in place for reproducible analysis, scaling up models, and performing experiments. The goal of this series of blog posts is to provide an overview of how to build a data science platform from scratch for a startup, providing real examples using Google Cloud Platform (GCP) that readers can try out themselves.
This series is intended for data scientists and analysts that want to move beyond the model training stage, and build data pipelines and data products that can be impactful for an organization. However, it could also be useful for other disciplines that want a better understanding of how to work with data scientists to run experiments and build data products. It is intended for readers with programming experience and will include code examples primarily in R and Java.
Why Data Science?
One of the first questions to ask when hiring a data scientist for your startup is
how will data science improve our product? At Windfall Data, our product is data, and therefore the goal of data science aligns well with the goal of the company, to build the most accurate model for estimating net worth. At other organizations, such as a mobile gaming company, the answer may not be so direct, and data science may be more useful for understanding how to run the business rather than improve products. However, in these early stages, it's usually beneficial to start collecting data about customer behavior, so that you can improve products in the future.
Some of the benefits of using data science at a startup are:
Identifying key business metrics to track and forecast
Building predictive models of customer behavior
Running experiments to test product changes
Building data products that enable new product features
Many organizations get stuck on the first two or three steps and do not utilize the full potential of data science. The goal of this series of blog posts is to show how managed services can be used for small teams to move beyond data pipelines for just calculating run-the-business metrics, and transition to an organization where data science provides key input for product development.
Here are the topics I am planning to cover for this blog series. As I write new sections, I may add or move around sections. Please provide comments at the end of this posts if there are other topics that you feel should be covered.
2. Tracking Data: Discusses the motivation for capturing data from applications and web pages, proposes different methods for collecting tracking data, introduces concerns such as privacy and fraud, and presents an example with Google PubSub.
8. Experimentation: Provides an introduction to A/B testing for products, discusses how to set up an experimentation framework for running experiments, and presents an example analysis with R and bootstrapping. Similar posts include A/B testing with staged rollouts.
9. Recommendation Systems: Introduces the basics of recommendation systems and provides an example of scaling up a recommender for a production system. Similar posts include prototyping a recommender.
10. Deep Learning: Provides a light introduction to data science problems that are best addressed with deep learning, such as flagging chat messages as offensive. Provides examples of prototyping models with the R interface to Keras, and productizing with the R interface to CloudML.
The series is also available as a book in web and print formats.
Throughout the series, I'll be presenting code examples built on Google Cloud Platform. I choose this cloud option because GCP provides a number of managed services that make it possible for small teams to build data pipelines, productize predictive models, and utilize deep learning. It's also possible to sign up for a free trial with GCP and get $300 in credits. This should cover most of the topics presented in this series, but it will quickly expire if your goal is to dive into deep learning on the cloud.
For programming languages, I'll be using R for scripting and Java for production, as well as SQL for working with data in BigQuery. I'll also present other tools such as Shiny. Some experience with R and Java is recommended since I won't be covering the basics of these languages.