What is Apache Spark and Why it is Popular among Data Scientists?

By Jyoti Nigania |Email | Jul 24, 2018 | 16266 Views

Why is Apache Spark popular among Data Scientists?

Answered by Sandeep Dayananda on Quora:
Apache Spark is one of the well known arguments that Spark is ideal for Real-Time processing where as Hadoop is preffered for Batch Processing.
What is Apache Spark?
Apache Spark is an open-source cluster computing framework for real-time processing. It has a thriving open-source community and is the most active Apache project at the moment. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations.

Features of Apache Spark:
Spark has the following features:
  • Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.
  • Speed: Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.
  • Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra apart from the usual formats such as text files, CSV and RDBMS tables. The Data Source API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.
  • Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG (Directed Acyclic Graph) of computation and only when the driver requests some data, does this DAG actually gets executed.
  • Real Time Computation: Spark's computation is real-time and has low latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.
  • Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.
  • Machine Learning: Spark's MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.
Spark performs better than Hadoop when:
1. Data size ranges from GBs to PBs.
2. There is a varying algorithmic complexity, from ETL to SQL to machine learning.
3. Low-latency streaming jobs to long batch jobs.
4. Processing data regardless of storage medium, be it disks, SSDs, or memory.

Similiarities Between Spark and Hadoop:
Let us look at how using both together can be better than siding with any one technology.

Handoop components can be used alongside Spark in the following ways:
1. HDFS: Spark can run on the top of HDFS to leverage the distributed replicated storage.
2. MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
3. YARN: Spark applications can be made to run on YARN (Hadoop NextGen).
4. Batch and Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

For more insights watch this vedio:
Apache Spark is in-memory data processing engine. It provides robust, flexible and user friendly platform for batch processing, stream processing, machine learning and large scale SQL.
Apache Spark was mainly designed for data scientists. It provides high level API's is many languages for example Java, Scala, Python and R. For execution of graphs, it has an optimized engine. It has largest open source project in data processing. Key feature of Spark is in-memory processing. It is not a new computing concept. With an underlying design of in-memory processing, there is a long list of database and data processing products. With in-memory processing, processing speed can be increased.

There is no need to fetch data from the disk again and again so obviously time is saved. Apache Spark has DAG computation engine that helps in-memory computations and acyclic data flow that results in high speed. Spark provides API's for three types of data sets (RDD). Resilient Distributed Data are immutable distributed collection of data that can be manipulated using functional transformations.

The changes applied to data in Spark are through compositional functional transformations. This approach in formulating and resolving data processing problems is favored by many data scientists. Apache Spark is known for its effective use of CPU cores over many server nodes. Along with Standalone Cluster Mode, Spark also supports other clustering managers including Hadoop YARN and Apache Mesos.

Being a distributed computing framework, it is essential for Spark to have robust cluster management functionality. Spark provides real time stream processing. One problem with Hadoop MapReduce was it could only handle and process already present data but this data was not real time. Spark Streaming solves this problem for data scientists. Lazy evaluations in Apache Spark help to increase system's efficiency. The transformations in Spark' RDD are lazy in nature and couldn't provide the desired results right away, rather a new RDD is formed for existing one but this increases the efficiency of the system. Apache Spark helps to ensure that data loss is reduced to zero. That's why it provides fault tolerance through Spark abstraction-RDD.

Source: HOB