In 2018, I invested a good amount of time in learning and writing about data science methods and technologies. In the first half of 2018, I wrote a blog series on data science for startups, which I turned into a book. In the second half, I started a new job at Zynga and learned a number of new tools including PySpark. This post highlights some of the different libraries and services I've explored over the past year.
I used Book down to turn the content from my blog series into a publishable format and self-published book. This package enables data scientists to turn R Markdown documents into a number of formats, including PDF, epub, and web documents. Most data scientists are probably already using markdown since it's used for Github documentation, which makes this package great to have in your toolbox. Many different kernels are supported, which means that this library isn't limited to R code. I found it much easier to author a book with Bookdown than using Latex, which I used for my dissertation.
One of my favorite tools available on GCP is Cloud DataFlow, which enables data scientists to author data pipelines that run in an auto-scaling and fully managed environment. In the past year, I wrote about how DataFlow can be used to author data pipelines, production models, and game simulations. It's a tool that fits many of the same use cases as Spark, but I find it preferable to use when building streaming applications.
At Zynga, we've begun standardizing our analytics tools using the Python ecosystem. Prior to starting my new role in 2018, I had limited exposure to this language and had to ramp up quickly on a new programming language. I wrote about my motivation to learn Python and my desire to learn new technologies such as PySpark and deep learning. I also wrote about my approach to learning a new data science language. I'm still learning new things about Python, but now have a decent grasp of the language.
4. AWS Lambda
One of the trends that I've been focused on over the past year is enabling data scientists to put models into production. A scalable way of achieving this goal is using tools such as AWS Lambda, which enables data scientists to deploy models in the cloud. With Lambda, you specify a function, such as applying a predictive model to a set of input variables, and AWS handles deploying the function, and making it scalable and fault tolerant. I wrote a post showing to deploy a classification model using AWS Lambda.
One of the big innovations of deep learning has been the ability to automatically generate features from semi-structured data, such as text documents. There have also been advances in feature engineering for structured data sets, such as the feature tools library, which automates much of the work that a data scientist would perform when munging data to build a predictive model. With this package, you define the associations between different tables (entities) in your data set, and the library generates a large space of features that can be applied to building models. I showed how this approach can be used to classify NHL games as regular or post-season matches.
2018 was the year that I finally started getting hands-on with deep learning. I initially used the R interface to Keras to dive into building deep learning models, but later transitioned to Python for working with this package. Since I've mostly been working with structured datasets, I haven't found many cases where deep learning is the best method, but I have found the ability to use custom loss functions to be quite useful. I also wrote about deploying Keras models with DataFlow using Java.
Prior to learning Python, Jetty was my preferred approach for standing up web services, with Java. Flask is a useful tool for exposing Python functions as web calls and is useful for building microservices. I wrote a blog post on using Flask to set up a web endpoint for a deep learning classifier. At GDC 2019, we'll show off how Zynga is using flask and unicorn to build microservices for internal use.
I've been doing more and more work in PySpark over the past year, because it can scale to huge datasets, and is easy to approach once you're familiar with Python. If you're looking to get started, I wrote a brief introduction to PySpark that shows how to get up and running with some common tasks. In the introduction, I showed off to use the Databricks community edition to get up and running with a Spark environment, but I also blogged about standing up a Spark cluster using Dataproc on GCP.
9. Pandas UDFs
Not all Python code can be directly applied in PySpark, but Pandas UDFs make it much easier to reuse Python code in Spark. With a Pandas User-Defined Function (UDF), you write functions using Pandas data frames and specify a partition key for splitting up the data frame. The result is that a large Spark data frame is partitioned among the nodes in the cluster, translated to Pandas data frames that you your function operates on, and then the results are combined back into a large Spark data frame. This means that you can use existing Python code in a distributed mode. I provided an example of Pandas UDFs in my PySpark introduction post. I'll be presenting how Zynga is using Pandas UDFs to build predictive models at Spark Summit 2019.
10. Open Datasets
In order to write about data science over the past year, I had to provide examples with open data sets. Some of the data sets that I've used include examples from Kaggle, BigQuery, and government data sets.
2018 was a great year to learn new data science technologies, and I'm excited to continue learning in the new year. In 2019, I'm looking forward to exploring reinforcement learning, spark streaming, and deep learning with semi-structured data sets.