With MapReduce having clocked a decade since its introduction, and newer bigdata frameworks emerging, lets do a code comparo between Hadoop MapReduce and Apache Spark which is a general purpose compute engine for both batch and streaming data.
We begin with hello world program of the big data world a.k.a wordcount on the Mark Twain's collected works dataset.
Problem Statement:
- Count the occurrence of each work and sort in descending order
- Remove special characters and the endings "'s", "ly", "ed", "ing", "ness" and convert all words to lowercase
Solution:
We look at MapReduce as well as Spark code to see how the problem statement can be solved using each of the frameworks.
MapReduce:
WordCountMapper.java:
Using the TextInputFormat, with byte offset as the key and the record as the value, we create the WordCountMapper class that removes all special characters using the regex [^\\w\\s]. The endings "'s", "ly", "ed", "ing", "ness" are removed using the regex ('s|ly|ed|ing|ness) and all words are converted to lowercase. The mapper then emits the (word, 1) as the K,V pair.
WordCountDriver.java:
We specify two job objects. The first job i.e. wordcount performs the wordcount on the dataset and the second job i.e. sort is used to sort the output from the first job in descending order.
The driver takes three arguments. First argument is the source directory, second is the target directory for the first mapreduce job and the third argument is the target directory for the second mapreduce job. Further the output from the first mapreduce job is used as input to the second mapreduce job. Rest of the code is self explanatory.
Run:
Submit the mapreduce job by passing the source, intermediate output directory and final output directory as arguments:
yarn jar wordcount-0.0.1-SNAPSHOT.jar com.stdatalabs.mapreduce.wordcount.WordCountDriver MarkTwain.txt mrWordCount mrWordCount_sorted
Output:
Spark:
As can be seen above, mapreduce restricts all the logic to mappers and reducers and we end up writing lot of boiler plate code rather than the actual data processing logic. This results in more lines of code for a simple use case like wordcount which brings us to Apache Spark.
Spark provides a more flexible approach using RDDs which are efficient for iterative algorithms as the data can be cached in memory once it is read instead of multiple reads from disk. Spark also provides more transformations and actions compared to only Map and Reduce.
Driver.scala
The entire wordcount logic can be written in one scala class.
- val file = sc.textFile(args(0)) - Dataset is read using sparkcontext
- val lines = file.map(line => line.replaceAll("[^\\w\\s]|('s|ly|ed|ing|ness) ", " ").toLowerCase()) - Removes all the spacial characters and endings using regex similar to mapreduce code
- val wcRDD = lines.flatMap(line => line.split(" ").filter(_.nonEmpty)).map(word => (word, 1)).reduceByKey(_ + _) - Split the record on space, map each word to value 1 and aggregate the values for each key
- val sortedRDD = wcRDD.sortBy(_._2, false) - Sort by value and arrange in descending order
Run:
Submit the spark application by passing the source and target directories as arguments and running the command:
spark-submit --class com.stdatalabs.SparkWordcount.Driver --master yarn-client SparkWordcount-0.0.1-SNAPSHOT.jar /user/cloudera/MarkTwain.txt /user/cloudera/sparkWordCount
Output: