PIG

This page is still under development.

How to process data with Apache PIG

PIG is a dataflow engine for Hadoop. It consists in a higher data level for map/reduce Pig is a high level scripting language layer on top of Hadoop. Pig latin scripts are useful to process, analyze and manipulate data files.

Through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages.

Pig excels for analysis with raw data. A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can load data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Pig can store the results into the Hadoop Data File System.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster. The Pig interpreter does perform optimizations to speed execution on Apache Hadoop.

STEP 1: USER SPECIFIES DATA FLOW

Pig data flow:

Load data …

Filter …

Group …

Sum …