Which Programming Language Use To Developer Big Data Frameworks In 2019

By Kimberly Cook |Email | Mar 4, 2019 | 2679 Views

I have briefly discussed some of the most popular Big Data frameworks and showed that Java is the de-facto programming language in Data Intensive frameworks. Java had significant advantages (e.g. Platform Independence, Productivity, JVM) over other languages during the timeframe 2004??2014 when most of the dominant Big Data frameworks were developed. In the last 10 years, lots of changes happened in the programming language landscape. Some classic languages have gone through major overhauls and modernizations. Also, some very promising, modern programming languages appeared with elegant features. Computer Hardware has gone through major changes (rise of Multi-Core processors, GPU, TPU) as well. Containerization with Docker, Kubernetes came to existence and became mainstream. If someone or some company wants to develop the next disruptive Big Data framework in 2019 (e.g. next Hadoop, Kafka, Spark), what programming language will be the best fit? The Big Data domain vintage language Java or any other language? First I will discuss the limitations of Java and then I will propose better alternatives in the context of Data Intensive framework development. Most of the points are also valid to develop Cloud Native, IoT and Machine Learning frameworks.
Limitations of Java
Every programming language has its limitations. Also, Java, the most dominant Programming Language in the Data Intensive domain, has its fair share of limitations. Here I will discuss the main limitations of Java in the context of Data Intensive framework development.

JVM: JVM plays a huge role in Java being widely adopted and becoming one of the most popular programming languages. But like many things in life, sometimes the biggest strength is also the biggest weakness. The main limitations of JVM are listed below:

  • Runtime: JVM abstracts the hardware from the developer. As a result, Java can never achieve near-native speed/performance like Close-to-the-Metal languages.
  • Garbage Collector: JVM provides Garbage Collector which helps developer greatly to concentrate only on the business problem and not thinking about Memory management. Most of the time, the default Garbage Collector with default settings works fine. But all hell broke down when Garbage Collector needs to be tuned. Java's Garbage Collectors has special issues with a large number of long living objects due to its "Stop the World" nature. Unfortunately, Data Intensive application means a lot of Objects. Apache Flink has developed its own Off-Heap memory management solution to tackle this issue. Apache Spark also has similar Off-Heap memory management solution using Project Tungsten. Many other Big Data frameworks (Cassandra, Solr) has faced the same issue. Using JVM to manage Objects and developing Off-Heap memory management to bypass JVM's Object management indicates that JVM is not yet handling a large number of Objects efficiently.
  • Memory footprint: Due to the large memory footprint of JVM, java is very bad at scaling down i.e. when 100 or more instances need to be run on a single machine. This is the reason why Linkerd has moved away from high-performance, high-throughput Scala+Finagle+Netty stack to Rust.
  • Developer Productivity: When Java first appeared in 1995, it was a very productive language at that time with its lean size and simplicity. With time, Java has added lots of features, increasing language specification size/complexity and can no more be considered among the most productive languages. In fact, Java is often criticized for its verbose nature needing lots of boilerplate code in the last decade.

Concurrency: Although Java was released in the pre-Multi-Core era, Java offers excellent Shared Memory based Concurrency support via Thread, Lock, deterministic Memory Model and other high-level abstractions. Shared Memory based Concurrency is difficult to program and prone to Data Race. Java does not offer any language level Message Passing based Concurrency (easier to program correctly) or Asynchronous event loop based Concurrency (better choice for I/O heavy tasks). Akka or other high-performance libraries can offer Message Passing or Asynchronous Concurrency in Java. But without the in-built support from JVM, they will not be as performant as languages which have native support (e.g. Go, Erlang, Node.js). In today's world of Multi-Core processors, this is a huge drawback of Java.

Serialization: Java's default serialization is very slow and has security vulnerabilities. As a result, Java serialization is another thorny issue in the Data Intensive landscape which Oracle has labeled as a horrible mistake and plans to drop in future Java versions.

Solution: Back to the Metal
Once declared obsolete and destined to demise during the heydeys of Java, the Close-to-the-Metal languages are gaining lots of interest in recent years and for good reasons. The C programming language was developed by Dennis Ritchie in Bell Labs during a time (1969??1973) when every cycle of CPU and every Byte of memory was very expensive. For this reason, C (and later C++) was designed to churn out the maximum performance from the hardware with the expense of language complexity. There is a misconception that in Big Data domain, one does not need to care too much about CPU/Memory. If someone needs more performance or need to handle more data, all is needed to add more Machines in Big Data Custer. But adding more Machines/Nodes will also increase Cloud provider bill. Also, with the rise of Machine learning/Deep learning, hardware architecture will change rapidly in the coming years. So, programming languages that give full control over hardware will only be more and more important in coming days.

Near Metal languages had another drawback to be used in Data Intensive frameworks: Platform dependency. Currently, Web Server Operating System is overwhelmingly dominated by Linux with around 97% market share:

The public Cloud is dominated by Linux as well with more than 90% market share:

The meteoric rise of Containerization with Docker, Kubernetes gives freedom to develop in any platform (e.g. Windows) targeting any other platform (e.g. Linux). Thus, Platform dependency is no more a critical factor to choose Programming Language for Data Intensive framework development.

Don't get me wrong, Java is still a formidable language to develop Data Intensive frameworks. With Java's new Virtual Machine GraalVM and new Garbage Collector ZGC, Java will be even more attractive language in almost any domain. But I am convinced that Close-to-the-Metal languages will be more dominant than Java/Scala in coming years to develop Data Intensive frameworks. Here I will pick three Close-to-the-Metal languages as a potential candidate to develop Data Intensive frameworks in 2019 over Java/Scala.
C++
Like the pioneer near-Metal language C, C++ also has its root in Bell Lab. During his time in Bell Labs, Bjarne Stroustrup has initially implemented C++ as "Object Oriented C" with first commercial release in 1985. C++ is a general-purpose, statically typed, compiled programming language which supports multiple programming paradigm (functional, imperative, object-oriented). Like C, it is also a near Metal language which gives full control over hardware without Memory safety or Concurrency safety. Similar to C, C++ also believes in the following Moto:

i.e. C++ will give the developers a very powerful language but it the responsibility of the developers to make the program Memory safe or Data Race free. C++ also has lots of features and functionality (Feature Hell) and probably one of the most difficult programming languages to master. Since 2000, C++ has added many features (Memory Model, Shared Memory based Concurrency, lambda) to make the language simpler, safer and Concurrency friendly. But these changes have come with a price, C++ language specification has become bigger and even more complex. Another issue of C++ is its long build time (I remember building a CORBA library taking 30 minutes). However, with modern C++ (e.g. C++17) and using principles like Resource Acquisition Is Initialization (RAII), it is comparatively easier to develop Memory safe, Data Race free programming in C++ in comparison to the older version of C++ (e.g. C++98). C++ still lacks language-level support for Message Passing Concurrency (will come in C++20) and Asynchronous event loop based Concurrency. Although there are many C++ libraries which supports Message Passing and Asynchronous event loop based Concurrency (legendary Node.js Asynchronous event loop based Concurrency was developed in C++). Learning C++ is difficult. Mastering C++ is even more difficult. But if there is a group of niche, experienced C++ developer, they can build unbeatable frameworks (in any domain including Data Intensive domain). There is the example of a 4 node ScyllaDB (written in C++) which outperforms the 40 node Cassandra (written in Java).

Pros:
  • One of the most used, mature programming language with the proven track record in many fields including Big Data or Distributed Systems.
  • Blazingly fast, near Metal language with maximum control over Hardware (CPU, GPU, TPU) and designed to extract maximum performance from Metal.
  • Excellent Tooling and a huge ecosystem of libraries. The language is getting easier and keep evolving (Bjarne Stroustrup on C++17).

  • Cons:
  • No language-level support for Message Passing or Asynchronous event based Concurrency (for I/O heavy tasks)
  • Very Steep learning curve and with its large specification, one of the most daunting languages to master. Not Ideal for a newbie, fresh graduate or dynamic language developer (PHP, Ruby, ...)
  • No language-level support for Memory safety, Data Race safety (although C++17 is safer compared to older C++). Few inexperienced, careless developer can wreak havoc in the whole project.
Notable Big Data Projects:

Rust
There was always a search for a dream Programming Language which will give the Performance/Control of near-Metal languages (C, C++) and safety of Runtime languages (Haskell/Python). Finally, Rust looks like "The Language that Promised" i.e. it gives the Performance/Control like C/C++ with the Safety of Haskell/Python. Inspired by the research programming language Cyclone (safer C), Graydon Hoare first developed Rust as a personal project which was later sponsored by Mozilla with active contribution from David Herman, Brendan Eich (creator of JavaScript) and many others. Rust is a statically typed, compiled System Programming language which supports Functional and Imperative programming paradigm. First announced in 2010, its first stable version is released in 2015. With the concept of Ownership and Borrowing, it offers the RAII from language level support and enables memory, thread-safe programming with the speed of C++ without any Garbage Collector or Virtual Machine. What really sets apart RUST from other near Metal languages (e.g. C/C++, Go) is that it gives the compile time safety i.e. if a Code compiles, it will run thread safe and memory safe as discussed in "Fearless Concurrency in Rust". It also offers language-level concurrency support for both Shared-Memory Concurrency and Message Passing Concurrency (via Channel) although it still lacks Asynchronous event-loop based Concurrency (in development). Here is an excellent talk by Alex Crichton from Mozilla explaining Rust Concurrency:

Rust also has expressive types and numeric types like ML languages/Haskell and has immutable data structure by default. As a result, it offers excellent functional Concurrency and data Concurrency like ML languages/Haskell. As both Rust and Web Assembly (the next big thing in Browser) are developed by Mozilla, high performant and fast Rust code can directly be converted to Web Assembly to run on Browser. Another very interesting feature is that Rust has self-hosted Compiler i.e. Compiler of Rust is written in Rust (After 23 years, Java not yet has self-hosted Compiler). Rust is also a great language in the Data Intensive domain due to its memory safe, data race free, zero cost abstraction, concurrency features. The Service Mesh platform Linkered is migrated from Scala+Netty+Finagle Stack to Rust and achieved much better performance and resource utilization. The Data Intensive runtime Weld which is written in Rust can give up to 30x performance gain for Data Intensive frameworks (e.g. Spark).

Pros:
  • Elegant Design. Rust is the first production level language which has successfully combined the power of C/C++, the safety of Python, the expressiveness of ML, Haskell. It has the potential to be a game changer language like C, C++, Java. Rust has won the most loved Programming language in StackOverflow developer survey for three consecutive years: 2016, 2017, 2018.
  • Compile time guarantee for Memory safe (no dangling pointer, no segmentation fault, no buffer overflow), Date Race free (no deadlock) program.
  • Near Metal language with maximum control over Hardware (CPU, GPU, TPU) and Blazingly fast. Idiomatic Rust is on par in performance with idiomatic C++ as shown by Benchmark game.
  • Concurrency friendly programming to take advantage of modern Multi-Core processors. Offers both Shared Memory and Message Passing Concurrency. Also, Asynchronous Event Loop based Concurrency (for I/O heavy tasks) is in progress (RFC 2394). With Haskell like expressive types and immutable data structure, Rust also offers functional Concurrency and Data Concurrency.

Cons:
  • With a high learning curve, Rust is not the ideal language for a newbie, fresh graduate or developer coming from dynamic languages e.g. PHP, Ruby, Python.
  • Rust lacks high adoption in the industry. As a result, Rust lacks libraries (crates), tooling which in turn preventing high adoption.
  • Rust language development is not yet a finished product. Rust may introduce major breaking changes or overly complex features and throttle its adoption.

Notable Big Data Projects:
  • Linkerd 2
  • Weld
  • Holochain
  • DataFusion
  • CephFS

Go
Go is the second language in this list which has its roots in Bell Labs. Two of the three co-creators of the language: Rob Pike (Plan 9, UTF-8) and Ken Thompson (creator of Unix) worked in Bell labs during the time when Unix, C, C++ was originated there. In the middle of 2000, Google had a huge problem of Scalability: Developer Scalability (1000 of developers can not work on the same codebase efficiently) and Application Scalability (Application cannot be deployed in a Scalable way on 1000 machines). Google also had the issue of integrating fresh graduates with existing multi-million lines complex C++ codebase, high compile time of C++ codebase and some other issues discussed in detail here. Finding existing languages (C++, Java) not sufficient to tackle those issues, Google has employed two of the best person in the software industry: Rob Pike and Ken Thompson to create a new language. Go was first announced in 2010 with the first official version released in 2012. Go designers have taken C as their basis and created a simple, productive yet powerful statically typed, compiled, garbage collected System Programming language. Another key feature of Go is that its compile time is very fast and it creates a single executable binary file which also contains Go Runtime and Garbage Collector (few MB) and requires no separate VM. Go also offers CSP based Message Passing Concurrency (Communicating Sequential Processes, originated from Tony Hoare paper) almost like the same way as Erlang. Although instead of using Actor and Channel (used by Erlang), Go uses goroutine (lightweight green threads) and channel for Message Passing. Another difference is Erlang uses point-to-point communication between Actors whereas Go uses flexible, indirect communication between goroutines. As a result, Go offers a very simple yet extremely scalable Concurrency Model to take advantage of modern Multi-Core processors. Here is an excellent talk about Go's Concurrency Model by Rob Pike:

To keep the language simple and productive, Go lacks lots of features like Shared Memory based Concurrency (although Go offers sharing memory between channel with the Moto: "Do not communicate by sharing memory; instead, share memory by communicating") and many high-level abstractions (e.g. Generics). Backed by Google, Go has been well accepted by the community/industry and has excellent toolings/libraries. Some of the best Infrastructure frameworks (Docker, Kubernetes), as well as Data Intensive frameworks, are developed using Go.

Pros:
  • The most productive and simple system programming language hands down. It is the perfect Close-to-the-Metal language for newbies, fresh graduates or developers with only having experience in programming Single Threaded, dynamic languages (PHP, Ruby, Python, JavaScript, ...)
  • With language-level support for Message Passing concurrency using goroutines (lightweight thread), it offers high concurrency and scalability. It also has a lightweight embedded Garbage Collector to offer Memory Safety.
  • Excellent tooling and library support. Already an established and proven Programming Language in the industry.

Cons:
  • Due to the presence of Runtime and Garbage Collector, low-level control of Hardware (e.g. Heap Memory) is not possible in Go. As a result, Go is not comparable with C, C++, Rust in terms of speed and performance. Also, Garbage Collector of Go lacks the maturity and performance of JVM Garbage Collector.
  • Due to its simplistic, minimalistic nature, Go lacks many key features of a general-purpose programming language e.g. Shared-Memory Concurrency, Generics.
  • Go does not offer any Compile time safety for Memory, Data Race.

Notable Big Data Projects:

Source: HOB