Introduction to Apache Spark
It is a framework for performing general data analytics on distributed computing cluster like Hadoop. It provides in-memory computations for increase speed and data process over MapReduce. It runs on top of existing Hadoop cluster and access Hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter
Is Apache Spark going to replace Hadoop?
Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark.
We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.
Hadoop MapReduce vs. Spark –Which One to Choose?
Spark uses more RAM instead of network and disk I/O it’s relatively fast as compared to Hadoop. But as it uses large RAM it needs a dedicated high-end physical machine for producing effective results. It all depends and the variables on which this decision depends keep on changing dynamically with time.
Difference between Hadoop MapReduce and Apache Spark
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.
From the Spark academic paper: “RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.” This removes the need for replication to achieve fault tolerance.
Do I need to learn Hadoop first to learn Apache Spark?
No, you don’t need to learn Hadoop to learn Spark. Spark was an independent project. But after YARN and Hadoop 2.0, Spark became popular because Spark can run on top of HDFS along with other Hadoop components. Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.
For developers, there is almost no overlap between the two. Hadoop is a framework in which you write MapReduce job by inheriting Java classes. Spark is a library that enables parallel computation via function calls.
For operators to running a cluster, there is an overlap in general skills, such as monitoring configuration, and code deployment.
Why should you learn Apache Spark?
Spark’s enterprise adoption is rising because of its potential to eclipse Hadoop as it is the best alternative to MapReduce – within the Hadoop framework or outside it. Similar to Hadoop, Apache Spark also requires technical expertise in object oriented programming concepts to program and run- thus opening up job opportunities for those who have hands-on working experience in Spark. Industry-wide Spark skills shortage is leading to a number open jobs and contracting opportunities for big data professionals.
For people who want to make a career on the forefront of big data technology, learning apache spark now will open up a lot of opportunities. There are several ways to bridge the skills gap for getting data related jobs and finding a position as a Spark developer. The best way is to take a formal training provided by Tek Classes that provides hands-on working experience and helps to learn through hands-on projects.