MapReduce

From Wikipedia :

"MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model,[11] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.""

MapReduce is a programming model designed specifically to read, process and write very large data volumes.

MapReduce implements the following features: * Automatic parallelization of Hadoop programs * Transparent management of distributed mode * Fault tolerance

MapReduce paradigm is based on sending computation to data location. Processing can occur on data stored either in a filesystem (unstructured data) or in database (strcutured data). MapReduce takes advantage of the locality of data, processing it near the place it is stored.

MapReduce program is composed of three stages : * Map stage or mapper’s job : Each worker node applies the "map()" function to the local data, and writes the output to a local temporary storage. * Shuffle stage : Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. * Reduce stage : Worker nodes now process each group of output data from Shuffle stage, per key, in parallel. It produces a new set of output and stores it to HDFS.

The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that reduces the network traffic.

User defines: a. <key, value> b. mapper & reducer functions

Hadoop handles the logistics

MaoReduce with two reducers :

ArchitectureHadoop

ArchitectureHadoop

Word Count Example

Word count is a typical example where Hadoop map reduce developers start their hands on with. This sample map reduce is intended to count the number of occurrences of each word in the provided input files.

The word count operation is implemeted in three steps * map step * shuffle step * reduce step.

In the map step, the test is tokenized into words. We form a key value pair with these words where the key is the word itself and value is ‘1’.

For example consider the sentence >“well done well done nice job done”

During Map step, the sentence would be split as words and form the initial key value pairs :

<done,1>
<well,1>

<done,1>
<well,1>

<done,1> <job,1>
<nice,1>

During Shuffle and sort, keys (words) are grouped together and alphabetically sorted. Each corresponding value is a table containing as many 1 as same words. Key value pairs will be rendered as :

<done,[1,1,1]>
<job,1>
<nice,1>
<well,[1,1]>

After Reduce step, key value pairs would give the number of occurrence of each word in the input. Reduce step consists in an aggregation process for keys.

<done,3>
<job,1>
<nice,1>
<well,2>

ArchitectureHadoop

ArchitectureHadoop

Hadoop 1.x: MapReduce v1 (MRv1) Architecture

JobTracker and TaskTracker are two essential processes involved in MapReduce execution for Hadoop 1.0. Both processes are now deprecated and replaced by Ressource Manager and Node Manager Daemons.

JobTracker

Job Tracker process runs on a separate Master node in the Hadoop cluster as it coordinates the execution of all MapReduce applications in the cluster.

It receives the request for MapReduce processing from the client and asks the NameNode to get data location.

It finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on the given node

It monitors the indvidual TaskTrackers and receives status update from the TaskTracker nodes

If JobTracker node is down, HDFS will still be functional but existing MapReduce jobs will be halted.

TaskTracker

An instance of the TaskTracker daemon runs on every slave nodes DataNode.

Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.

JobTracker assigns Mapper and Reducer tasks to the TaskTracker node.

TaskTrackers manage the processing ressources on each slave node in the form of processing slots.

The total number of Map and Reduce slots indicate how many map and reduce tasks can be executed at one time on the slave node.

When a TaskTracker becomes unresponsive, JobTracker assigns the task to another node.

In Hadoop 1.x MapReduce provides both resource management (through JT and TT) and data processing.

ArchitectureHadoop

ArchitectureHadoop

Scheduler

The sheduler : * Is responsible for allocating ressources to the various running applications. * Performs its scheduling function based on the ressource requirements from the applications based on the abstract notion of a resource Container which incorporates : * memory * cpu * disk * network, etc.

Job scheduling in Hadoop is performed by a master, which manages a number of slaves. Each slave has a fixed number of map slots and reduce slots in which it can run tasks. Typically, administrators set the number of slots to one or two per core. The master assigns tasks in response to heartbeats sent by slaves every few seconds, which report the number of free map and reduce slots on the slave.

The first Hadoop Scheduler called FIFO (First In First Out) Scheduler does not have policies for preempting running jobs. Two other schedulers were implemented to overcome FIFO Scheduler limitations : the Fair scheduler (developped by Facebook) and the Capacity schedulers (developped by Yahoo). They both target the same uses. They both support : * hierarchical queues : All queues descend from a queue named “root”. Available resources are distributed among the children of the root queue. The children distribute the resources assigned to them to their children. * preemption : jobs that have high priority or have been waiting for a long time to run are allowed to run * delay scheduling * multi-resource scheduling

The Fair scheduler is the Hadoop default scheduler.

Fair scheduling is a method of assigning resources to jobs such that all jobs get an equal share of resources over time.

When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs.

Jobs are organized into pools and ressources are fairly shared between these pools. Each user has its own pool by default. A minimum number of map slots and reduce slots as a maximum number of jobs (which can be simultaneously executed) are defined for each pool.

Details

ArchitectureHadoop

ArchitectureHadoop

HADOOP 2.X: Yarn

Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.

YARN: Next Generation of MapReduce

The main idea of MRv2 is to split up the two major functionalities of the JobTracker, ressource management and job scheduling/monitoring.

Version 2 of Hadoop allows MapReduce jobs and graph processing jobs to coexist in the same cluster.

The ressource manager has two main components : * Scheduler * ApplicationsManager

ApplicationManager Ressource Manager and Node Manager

User submits an app request by passing config for Application Manager to Resource Manager.
Resource Manager allocates a container for Application Manager on a node. Tells Node Manager in charge of that node to launch the Application Manager container.
Application Manager registers back with Resource Manager. Asks for more containers to run tasks.
Resource Manager allocates the containers on different nodes in the cluster. 
Application Manager talks directly to the Node Managers on those nodes to launch the containers for tasks.
Application Manager monitors the progress of the tasks.
When all the application's tasks are done, Application Manager unregisters itself from Resource Manager.
Resource Manager reclaims the previously allocated containers for the application.

NodeManager

Glossary

PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.

Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.

NamedNode - Node that manages the Hadoop Distributed File System (HDFS).

DataNode - Node where data is presented in advance before any processing takes place.

MasterNode - Node where JobTracker runs and which accepts job requests from clients.

SlaveNode - Node where Map and Reduce program runs.

JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker - Tracks the task and reports status to JobTracker.

Job - A program is an execution of a Mapper and Reducer across a dataset.

Task - An execution of a Mapper or a Reducer on a slice of data.

Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.