Arguing about which programming language is the best one is a favorite pastime among software developers. The tricky part, of course, is defining a set of criteria for "best."
With software development being redefined to work in a data science and machine learning context, this timeless question is gaining new relevance. Let's look at some options and their pros and cons, with commentary from domain experts.
Even though, in the end, the choice is at least to some extent a subjective one, some criteria come to mind. Ease of use and syntax may be subjective, but things such as community support, available libraries, speed, and type safety are not. There are a few nuances here, though.
Execution speed and type safety
In machine learning applications, the training and operational (or inference) phases for algorithms are distinct. So, one approach taken by some people is to use one language for the training phase and then another one for the operational phase.
The reasoning here is to work during development with the language that is more familiar or easy to use or has the best environment and library support. Then the trained algorithm is ported to run on the environment preferred by the organization for its operations.
While this is an option, especially using standards such as PMML, it may increase operational complexity. In addition, in many cases things are not clear-cut, as programming done in one language may call libraries in another one, thus diluting the argument on execution speed.
Another thing to note is type safety. Type safety in programming languages is a little like schema in databases: While not having it increases flexibility, it also increases the chances of errors.
"You can run an experiment for several hours, or even days, just to find out that the code crashed because of an incorrect type conversion or a wrong number of attributes in a method call," says Burkov.
Despite having what is arguably the largest footprint in an enterprise deployment, Java is not getting much love these days. Some of this may have to do with the "coolness factor," as Java has been challenged by new programming languages, but there are also some very real concerns here.
What has greatly helped Java establish it footprint, namely the JVM, is also a reason why people are skeptical about using it for machine learning. Similarly, one famous feature of Java, which helps deal with the complexities of C++, garbage collection, may pose problems in production environments.
When discussing trends in software development with Paco Nathan, managing partner at Derwen and data science practitioner and thought leader, the topic did come up.
Nathan notes that the trend he sees is toward real-time applications, and this is not something he believes the JVM is well-suited for, as it is an abstraction over the hardware. Adding a layer between the code and the hardware provides cross-platform portability, but also slows down execution.
Nathan also cites Ion Stoica, the initiator of Apache Spark, which is heavily used for real-time applications. Nathan mentioned that one of the rules Stoica has recently set for his research team in Berkeley is abolishing Java.
Nathan commented that he expects that to spill over from research to industry over a five-year timeframe, as is typical for directions initiated in research environments. But maybe we should not be too fast in writing off Java.
The ups and downs that have been following Java during its stewardship by Oracle may have contributed to it's falling out of grace. They may also have something to do with the perceived stalemate in the evolution of the JVM.
With enterprise Java being handed off to the Eclipse foundation, however, there is a chance Java and the JVM may be revitalized. There are also initiatives, such as Gandiva, which aim to optimize Java code for specialized hardware, potentially making it a competitive option for machine learning.
In addition, that large footprint has given rise to initiatives, such as DeepLearning4J, which aim to bring to Java users access to the same libraries typically used through other languages.
According to a recent survey by KDNuggets, Python is the undisputed leader in use for data science and machine learning. Some often cited reasons for this preference are the wide choice in libraries and the fact that it's considered an easy language to work with.
Ashok Reddy, GM DevOps at CA Technologies, notes that Python was the language of choice in his recently completed masters in AI and Machine Learning at Georgia Tech.
Reddy goes on to add that Python is gaining popularity in universities due to its simplicity, so graduates are more likely to know Python than Java. Beyond simplicity, he also cites the abundance of libraries as a key reason for this.
Reddy notes that, from a performance perspective, C is also a popular choice for use in AI and embedded-IoT applications, but Java is not going away. Reddy also sees a pattern in using Python for development and then other languages for deployment of machine learning algorithms.
This also applies internally at CA, as Reddy notes that, in addition to having legacy code in C and Java, the cross-platform portability that Java offers is a key priority for CA.
"Many startups use Ruby or Python initially, and when they grow up they switch to Java," says Reddy.
In the KDNuggets survey, R's share seems to be dropping compared to last. R, however, has been gaining enterprise adoption over the last few years.
In some ways, R is not a typical programming language, as it's not a general purpose one. R's roots lie in statistics, as it has been developed specifically to deal with such needs.
That, and the fact that its open source makes for a wealth of off-the-shelf libraries for common and not-so-common related tasks. The flip side of this is that R has been plagued by issues such as memory management and security, and its syntax is not very straightforward or disciplined.
In the past few years, R has seen development environments been built around it in order to fill the gaps required to take it out of the data science lab and into enterprise deployments.
One of those, created by Revolution Analytics, has been integrated into Microsoft's offering (Visual Studio, SQL Server, Power BI, and Azure) following its acquisition by Microsoft. Another one, R Studio, has been integrated initially with Apache Spark and now with Databricks.
The way this was done is indicative of another strength of R -- its package system. It is through this, and its ties with the academic community, that R keeps up to date with all latest developments in data science and machine learning.
While R may be a good choice for development, its value in production is highly dependent on its supporting ecosystem.
Julia, Golang, Rust, Swift, and JVM languages
And what about those who do not want the dynamic typing of Python or the lecagy baggage of Java or C / C++? Well, apart from the fact that Python 3.6 and later supports static typing.
Burkov notes that Scala and Kotlin, two newer languages based on the JVM, have optional typing, but a steep learning curve and low user adoption, respectively. And, in the end, we might add, they also come with the same restrictions imposed by the JVM.
Swift, notes Burkov, has static typing and low availability of machine learning libraries/data analysis. Other options suggested by contributors in the same thread are Golang, Julia, and Rust.
Golang has been pointed out as being fast, thread ready, easy, clean, compiled, and simple. And it has increasing support for libraries for NLP, general machine learning, and data analysis, extraction, processing, and visualization.
Rust has been pointed out as compiling natively and efficiently like plain C/C++, lacking garbage collection, and being type-safe and rich. Admittedly, even by its proponents, though, it is not really ready for ML due to lack of ML specific libraries.
The choice of programming language is not a simple one, and in the end, it may not even be the most important one either. As pointed out by Luiz Eduardo Le Masson, data science leader at Stone Co.:
"For 'ordinary machine learning,' it does not matter what language you use. But when you need to have real online learning algorithms and inferences in real time for millions of simultaneous clusters and respond in less than 500 ms, the topic does not only involve languages but architecture, design, flow control, fault tolerance, resilience."