05 MapReduce

What is MapReduce in Hadoop?

MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

MapReduce programs work in two phases:

  1. Map phase
  2. Reduce phase.

An input to each phase is key-value pairs. In addition, every programmer needs to specify two functions: map function and reduce function.

In this beginner training, you will learn-

How MapReduce Works? Complete Process

The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing.

Let’s understand this with an example –

Consider you have following input data for your Map Reduce Program

Welcome to Hadoop Class Hadoop is good Hadoop is bad

img

MapReduce Architecture

The final output of the MapReduce task is

bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1

The data goes through the following phases

Input Splits:

An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map

Mapping

This is the very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, a job of mapping phase is to count a number of occurrences of each word from input splits (more details about input-split is given below) and prepare a list in the form of

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, the same words are clubed together along with their respective frequency.

Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each word.

MapReduce Architecture explained in detail

How MapReduce Organizes Work?

Hadoop divides the job into tasks. There are two types of tasks:

  1. Map tasks (Splits & Mapping)
  2. Reduce tasks (Shuffling, Reducing)

as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a

  1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
  2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.

img

06 MapReduce example

06 MapReduce example