MapReduce

Hadoop MapReduce includes many computers but little communication stragglers and failures. Here we cover about mapreduce concepts with some examples. PDF guides on Hadoop MapReduce is provided at the end of section.  In functional programming concepts MapReduce programs are designed to evaluate bulk volume of data in a parallel fashion. In this concept needs to divides the work across a large number of machines. This type of model would not scale large clusters like hundreds or thousands of nodes if the components needs to keep the data or information on the nodes synchronized at all times would prevent the system from performing reliability or efficiently at large scale. In MapReduce all the data elements cannot update.

That is in MapReduce all the data elements are immutable. In a mapping task we can change an input (key,value) pair and it does not reflected in the input file and the communication happens only by generating new output (key,value) pairs which are forwarded by the Hadoop system into the next phase of execution.

The MapReduce program converts lists of input data elements into lists of output data elements. MapReduce program includes three phases.

  1. Map Phase
  2. Sort Phase
  3. Reduce Phase
Map Phase
Map Phase

Map Phase

The first phase of MapReduce program is known as mapping phase. Here we are using one function called Mapper. A list of data elements are put one at a time into the mapper function and this mapper function transform each data element one by one to an output data element.

Map Phase Key Value Pairs
Map Phase Key Value Pairs

Map phase only specify operations on key-value paires.

Sort Phase

Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. This phase is optional.

(key1,value289)
­(key1,value43)
(key1,value3)
.
.
(key2,value512)
(key2,value11)
(key2,value67)

Example :- Word-count

Reduce Phase

Reducing is the process of aggregate values together. The reducer function access an iterates of input values from an input list and it then combines together to returning a single output value.

Reduce Phase
Reduce Phase

Example  :- Word count

Syntax: – Code for word count in MapReduce

def mapper(key,value):
for word in value.split():
yield word,1
def reducer(key,values):
yield key,sum(values)

The Hadoop MapReduce framework handles putting them together in MapReduce and it uses them to process large volume of information. This program includes two components:-

  1. Mapper
  2. Reducer

In this concept the above two components are slightly extended from the basic principle.

Keys and values

 

Here no values stand its own. A key is associated with every value and these keys identify related values.

Example :-   A log of time-coded speedometer readings from multiple cars could be keyed by license-plate number; it would look like:

AAA-123   65mph, 12:00pm
ZZZ-789   50mph, 12:02pm
AAA-123   40mph, 12:05pm
CCC-456   25mph, 12:15pm

The mapping functions and reducing the values not only receive, but the pair (key value,). The output of each of these functions is the same: a key and value must be issued to the following list in the data stream.

MapReduce is also less strict than others about how languages work Mapper and reducer. In more formal functional mapping and configuration of the reduction, an allocator must produce exactly one output element for each input element, and a reducer must produce exactly one output element for each input list. In MapReduce, an arbitrary number of values can be output for each phase, an allocator can assign an entry to zero, one, or a hundred outlets. A reducer may compute more than one input list and issue one or a dozen different outputs.

Keys divide the reduce space

Usually all the output values are not reduced together in MapReduce. All the values with the same key are presented to a single reducer unit and this operation performs independently of any reduce operations occurring on other lists of values with different keys attached.

 

The Driver Method

The final component of Hadoop MapReduce program is called driver. The driver initializes the job and instructs the Hadoop platform to run our code on a set of input files, and controls where the output files are palced.

Example : – A cleaned-up version of the driver

public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}

This method sets up a job to execute the word count program across all the files in a given input directory (the inputPath argument). The output from the reducers is written into files in the directory identified by outputPath. The configuration information to run the job is captured in the JobConf object. The mapping and reducing functions are identified by the setMapperClass() and setReducerClass() methods. The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass(). By default, it is assumed that these are the output types of the mapper as well. If this is not the case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class will override these.

The input types fed to the mapper are controlled by the InputFormat used. The default input format, “TextInputFormat,” will load data in as (LongWritable, Text) pairs. The long value is the byte offset of the line in the file. The Text object holds the string contents of the line of the file.

The call to JobClient.runJob(conf) will submit the job to MapReduce. This call will block until the job completes. If the job fails, it will throw an IOException. JobClient also provides a non-blocking version called submitJob().

MapReduce Data Flow

The below figure explain how the components works together at higher level.

MapReduce Data Flow
MapReduce Data Flow

MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS and these files are evenly distributed across all our nodes. Running a MapReduce program contains running mapping tasks may be many or all of the nodes in our cluster. Each of these mapping tasks is equivalent: no mappers have particular “identities” associated with them. Therefore, any mapper can process any input file. Each mapper loads the set of files local to that machine and processes them.

When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged between machines to send all values with the same key to a single reducer. The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the only communication step in MapReduce. Individual map tasks do not exchange information with one another, nor are they aware of one another’s existence. Similarly, different reduce tasks do not communicate with one another.

The user never explicitly transforms information from one machine to another; all data transfer is handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys associated with values. This is a fundamental element of Hadoop MapReduce’s reliability. If nodes in the cluster fail, tasks must be able to be restarted. If they have been performing side-effects e.g., communicating with the outside world, then the shared state must be restored in a restarted task. By eliminating communication and side-effects, restarts can be handled more gracefully.