In the ever-changing ecosystem of data science tools, you often find yourself needing to learn a new language in order to keep up with the newest methods or to more effectively collaborate with co-workers. I've been an R coder for a few years but wanted to transition to Python in order to take full advantage of the deep learning libraries and tools such as PySpark. Also, I joined the data science team at Zynga, where Python is the preferred language. It's only been a few weeks, but I'm starting to get the hang of performing exploratory data analysis and predictive modeling in this new language. This isn't the first time that I've tried to quickly ramp up on a new data science language, but it has been the most successful. I wanted to share some of my guidelines for getting up and running with a new programming language as a data scientist.
Focus on outcomes, not semantics
While it's useful to know the fundamentals of a programming language before digging into coding, I find that only a brief introduction is needed before getting hands-on with the language. Prior to coding in Python, I read through the second chapter of Data Science from Scratch
, which provides a crash course in Python. My next step was to put together a list of tasks that I wanted to be able to accomplish in Python, which included:
Reading and writing data in CSV files
- Performing basic operations on data frames, such as showing data types
- Plotting histograms and line charts
- Connecting to a database and pulling data into a data frame
- Creating a logistic regression model
- Evaluating metrics for a model, such as an accuracy and lift
Instead of focusing on the semantics of the language, such as understanding the difference between lists and tuples in Python, I started getting hands-on with performing everyday data science tasks. I knew how to do these tasks in R and needed to learn how to translate those skills into a new language. For example, I learned that summary() in R is similar to describe() for Pandas data frames.
I quickly discovered that performing data science tasks in Python usually involves taking advantage of a number of different libraries. The tasks above are much easier to perform when using the Pandas and SciKit-Learn libraries. This leads to the next lesson I learned when ramping up on Python.
Learn the ecosystem, not the language
Tools like Tidyverse
make R much more than the base language, and other data science languages have similar libraries that provide useful extensions to the language. Instead of focusing on learning just the base language, I made learning libraries part of my initial Python learning process. I explored the following libraries as a starting point.
- Pandas: Provides dataframe functionality similar to R.
- Framequery: Enables using SQL on dataframes.
- SciKit-Learn: Provides ML models with standard interfaces.
- Pandas-gbq: An interface to BigQuery for Python.
Using these libraries made it easier to accomplish tasks I was already familiar with in other languages. Libraries like frame query are useful when learning a new language because you can use SQL to work with data frames before becoming familiar with the Pandas way of working with data frames. I found it easy to use because it's similar to the sqldf library I â??ve already used in R.
Use cross-language libraries
It's great to be able to bootstrap your learning process by re-applying libraries you already know in a new language. One of the libraries I explored early on in my learning process was the Keras deep learning library, which I had previously explored using in R. While the syntax is different between libraries, the concepts that you apply when using the library are the same in both R and Python. Here's an example of setting up a Keras model in these languages:
# Creating a Keras model in R
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = 10) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
# Creating the same Keras model in Python
model = models.Sequential()
Being able to build on libraries that I was already familiar with helped speed up my learning process. Plotly is another library that I had prior experience with R that I'm now using in Python.
Work with real-world data
Sample data sets, such as the diabetes dataset in sklearn, are great for getting up and running with a new language or a new library, but you won't really learn the details of what you're doing until you need to apply these methods to a new problem. For example, you may need to perform multi-class classification rather than re-applying binary classification from an example data set. It's helpful to apply data sets from your organization early on when learning a new language.
Working with real-world data is useful for many reasons, such as:
- Scale: Training data sets are usually small, and using real-world data often involves much larger data sets, requiring methods such as sampling.
- Representation: Using data from your organization means that you'll need to define how to label instances and encode features for modeling.
- Munging: You'll need to get hands-on with pulling the data into memory and performing data munging tasks such as dealing with missing values.
- Evaluation: Using an existing data set from your organization means that you can compare results in the new language with prior results in other implementations, such as comparing R's glm with sklearn.linear_model.
This guideline is similar to the first one, it helps to get hands-on with real-world data and focus on producing results.
Start locally if possible
When learning languages for distributed computing, such as scala for Spark, it's often overwhelming to get an environment up and running. My recommendation is to not focus on distributed systems or virtualization as a starting point if it's not necessary. For example, with Spark, you can set up a single machine instance locally and use tools like Zeppelin
to provide an interactive front end. When you're learning a new language, you shouldn't have to worry about deployment concerns, such as pushing code to a server. Running code locally often means that more debugging tools are available.
I have the same recommendation for getting started with Python. Jupyter notebooks make it easy to get up and running with Python but are more useful once you've gotten familiar with the language. To get started, I recommend using an IDE such as PyCharm. Once you're ready to start collaborating with teammates, tools such as JupyterLab provide great environments.
Get hands-on early
When learning a new language, you won't know what you don't know until you try it. This is related to my first guideline, and it helps to get hands-on early to find out where you have gaps in your knowledge. Instead of frontloading a bunch of reading when learning Python, I found it more effective to have a mix of reading and programming. Once I got into the code, I identified new areas where I needed to learn, such as finding out that Pandas was one of the first libraries that I needed to use.
There are a few different approaches that you can use to get hands-on when learning a new language or library. When I was reading Deep Learning with R, I used the provided notebooks to start working on sample problems. Also, some companies provide Jupyter notebooks that new employees can use during their onboarding process. The approach I've usually taken is using an existing data set while reimplementing a model in a new language.
Push your knowledge
Once you've learned some of the basics of the language and learned how to perform common data science tasks, it's useful to try out some of the features of the language that is new to you. For example, when I was first learning R, I wrote basic examples of using functions as parameters and using applications on data frames. I was relatively new to functional programming, and these examples helped me learn about new programming idioms. Apply is a really useful feature that I'm now able to take advantage of with Pandas.
This is something that I would save for a little later in the learning process because it's not necessary for getting up and running with a new language but is a step that's required for mastery of a language. If you're already familiar with several programming languages, this is a step you can take on earlier.
Ask for feedback
It's great to have channels for feedback, whether it's another data scientist at your organization or online forums. You may learn about a new library or language features that you haven't yet discovered. This is more challenging when you're at a small company or working on an independent project. During my last role, when I was working as the only data scientist at a startup, I found that the rstats
subreddit was a useful place to ask for feedback. There's also programming communities for different languages that can be great places to learn. I attended the useR
! conference in 2016 and it was great for getting connected with scientists and practitioners using R. Python has similar conferences, but I haven't attended any yet.
Learning a new language for data science takes time, but it helps to have a plan for how you'll spend your time ramping up. My overall recommendation is to dive right in and get coding on applied problems with real data. You'll find out where you have gaps and can use that to guide your learning process.