As requested, I'm publishing this guide for those wishing to choose between Python and R Programming languages for Data Science. You may be new to Data Science or you need to pick one choice on a project, this guide will help you.
Not a disclaimer: I am a manager of Data Scientists for one of the largest employer of Data Scientists (Deloitte). These are my opinions. I've also consulted with R and Python for several decades. I'm language agnostic but have been heavily involved with the Python community for 15 years or so.
There may be a third choice
Hadley Wickham https://twitter.com/hadleywickham, Chief Data Scientists of RStudio, had replied: "Replace 'vs' with 'and'." Prompted by this, using Python/R together is a third choice I will cover. This option intrigues me and I will cover this toward the end of this article.
How we compare R and Python
Not an exhaustive list by any means, here are some factors worth comparing between the two languages:
- History: R and Python have distinctly different histories that sometimes crossed paths.
- Community: many complex sociological anthropological factors observed through fieldwork.
- Performance: a careful comparison and why it is so hard to compare.
- Third Party Support: modules, code bases, visualizations, repositories, organizations, and development environments.
- Use Case: some tasks and types of work may lend themselves to one or the other.
- Can't we all just get along? Using Python with R and R with Python.
- Predicting R vs Python A telling exercises of eating our own dogfood
- Preference: the ultimate answer.
A brief history:
- ABC -> Python Invented (1989 Guido van Rossum) -> Python 2 (2000) -> Python 3 (2008)
- Fortan -> S (Bell Labs) -> R Invented(1991 Ross Ihaka and Robert Gentleman) -> R 1.0.0 (2000) -> R 3.0.2 (2013)
The first thing to keep in mind when comparing the users of Python vs R is that:
Only 50% of the users of Python overlap with R
That is assuming that all of the R programmers would call there use "Scientific and Numeric". We also determined this distribution is true regardless of the level of the programmer.
To further dive into the Python "Hype" read my article on my Python Hype Survey Results:
If we only look at the scientific and numeric community, that brings us to our second, which community? There are several sub-communities within the overall scientific and numeric communities. Although there may be some overlap as you would suspect they really behave differently how they interact with the larger R/Python communities within.
Some examples of sub-communities using Python/R:
- Deep Learning
- Machine Learning
- Advanced Analytics
- Predictive Analytics
- Exploration and Data Analysis
- Academic Scientific Research
- An almost endless list of Computation Fields of Study
While each domain seems to serve a specific community, you would find R more prevalent in places like Statistics and Exploration. Not so long ago, you could be up-and-running and doing some fairly meaningful exploration with R in far less time it would take to install Python and do similar exploration.
All that's changed by the disruptive technology called Jupyter Notebooks and Anaconda
note: Jupyter Notebooks: adds the ability to code Python/R in the browser; Anaconda: allows easy install and package managing for Python and R
Now that you can get up and running in an environment-friendly to providing reporting and analysis out of the box, there has been a barrier removed that sat between those who wish to do the task and the language they love. Python now can come packaged in a platform independent way and provide quick-down-and-dirty analysis quicker than ever before.
Another distinction in the community that impacts language choice is the idea of "open source". Not just open source libraries, but the impact of collaborative communities contributing to open source. Ironically, open source licensed software like Tensorflow to GNU Scientific Library (Apache and GPL, respectively) both seem to have both Python and R bindings. Despite the copy leftness of R, there still seems to be more support by purist for the Python community. On the flip side, there seems to be more Enterprise support for R especially those with history in Statistics.
This never goes well. The reasons are that there are too many metrics and situations to test. It's hard to test on any one particular hardware. Some operations are optimized in one language and not the other. Surely, you will miss something, someone will complain, friends will be lost, and the whole analysis will be tossed away with gusto! Regardless of that, here we go...
Looping - Silly
Before we go there let's think about how Python is used VS R. Do you really want to do a lot of looping over things in R? My guess is the intent of the language may be slightly different.
As a sanity check, including the load time and just running on the command line: R was real 0m0.238s, Python real 0m0.147s. Again, not a scientific test.
A quick test shows Python is significantly faster. Usually, it just does not matter.
What does matter to a Data Scientist regarding speed? The emerging trend found in both languages is their ability to be used as a command language. For example, most of those programming Python rely heavily on Pandas for their work. This moves the topic to what modules and libraries exist in each language and how they perform. That is a more meaningful comparison.
Third Party Support
Python has PyPI, R has CRAN, both have Anaconda.
CRAN uses it's internal `install.packages` command built into the distribution. On this date, there are around 12K packages available on CRAN
. Scrolling through the list it appears over 1/2 or more of all packages has something to do with Data Science. Roughly 6K or more.
PyPi has over 10X the number of packages, 141K packages. There are 3.7K packages labeled as Scientific Engineering specific. There are many found that are indeed scientific and are just not labeled as such.
In both cases, there seems neither suffers from gross over duplication of efforts. Sure I get 170 projects in PyPi when I search for "Random Forest," however the packages within seemingly are different.
Although Python has 10X the number of packages, the number of Scientific Data Science packages are about the same if not slightly fewer for Python
Availability of third-party packages is a very big deal. Having to write something from scratch just so it will run in your language of choice is a bummer. Likewise, I do hope if you do do that you contribute that work back to the Open Source Community.
The Speed on Stuff that actually matters
DataFrames vs Pandas is probably a much more meaningful comparison and one that really matters.
We conducted an experiment: compare the execution times on a complex exploratory effort while mirroring each part. Here are the results:
Source code: http://nbviewer.jupyter.org/gist/brianray/4ce15234e6ac2975b335c8d90a4b6882
As we see, Python+Pandas than the native R DataFrames was largely quicker. Please note this does not mean Python is a quicker runtime. Pandas is built mostly on Numpy written in C.
What I am really saying is ggplot2 vs matplotlib. Disclaimer: matplotlib was written one of the people I valued most in the Python community and one who taught me Python, John D. Hunter
Matplotlib is an 800lb gorilla and customizing can be done although not easily learned but can be very extensible. Customization on ggplot is not easy either and some would say it is even more difficult.
If you like pretty plots and you don't need to customize at all, R is my pick. If you need to do a lot more then Matplotlib and possibly even the interactive bokeh would be helpful. Similarly, ShinnyR for R would add that interactivity you may be seeking.
Can't we all just get along?
One would ask, why can't you just use both at the same time.
There are times you can use the two together. Times when:
- your group or organization allows you.
- you can get both set up and maintained easily in your environment.
- your code does not need to go into another system.
- you aren't creating a confusing mess for someone else.
Some ways to use the 2 together are:
- Python wrappers for R like rpy2, pyRserve, Rpython, ... (rpy2 extension enables the Jupyter below)
- R has a couple: rPython, PythonInR, reticulate, Jython, SnakeCharmR, XRPython (reticulate is written up here https://blog.rstudio.com/2018/03/26/reticulate-r-interface-to-python/)
- Use Jupyter - Mix the two, example below:
Then we can actually pass the panda's data frame and it is automatically (by rpy2) converted into an R dataframe, passed with the "-I df" switch:
Predicting R vs Python
Someone on Kaggle wrote a Kernel on Predicting whether a developer uses R or Python. He came up with some interesting observations based on the data:
- If you're looking to move towards Linux next year, you're more likely a Python user
- If you studied statistics you're more likely R, and if computer science then Python
- If you're young (18â??24 years old), you're more likely Python user
- If you do code competitions, you're more likely a Python user
- If you want an android next year, you're more likely a Python user
- If you want to learn SQL next year, more likely R user
- If you use MS office, you're more likely an R user
- If you want a Raspberry Pi next year, you're more likely a Python user
- If you're a full-time student, you're more likely to be a Python user
- If you're using Agile methodology, you're more likely to be a Python user
- If you're more worried than excited about AI, then you're more likely to be an R user
When I had corresponded with Alex Martelli, Google and Stack Overflow lord, he had explained to me why Google had started with a few languages they officially supported. Even in the free-spirited innovated space like Google, there seem to be some restrictions. That is a preference that goes into play here as well, corporate preference.
Aside from corporate preference, someone in an organization is usually the first. I know who the first was at Deloitte to use R. He's still with the firm and is the now the Lead Data Scientists. Point being and my general advice in all things follow what you love, love what you follow, lead the pack, and love what you do.
One qualifying statement, although I've never been a tool first thinker if you are working on something important it may not be the best time to experiment. Mistakes are possible. However, a very well designed Data Science project leaves some headroom for the Data Scientists. Use a portion of that to learn and experiment. Keep an open mind and embrace diversity.
In closing, I'm sticking mostly with Python but am looking forward to learning more R, with and without Python.