Hadoop PIG

Apache pig is a type of a query language and it permits users to query Hadoop data similar to a SQL database. Apache pig is also a platform for examine huge data sets that contains high level language for expressing data analysis programs coupled with infrastructure for assessing these programs. There are some pdf tutorials about Apache pig at the end of this section. Apache pig program structure is amenable to substantial parallelization which in turns enables them to handle huge amount of data set.

To get rolling we are using the following things:-

  1. A Java SDK Installed
  2. Ant Installed
  3. Subversion
  4. A working installation of Hadoop

After that we can install Pig and test it out.

Pig compiles these data flow programs into sequence of jobs or map-reduce jobs and executes them using Hadoop. It is also possible to execute Pig Latin programs in a “local” mode (without Hadoop cluster), in which case all processing takes place in a single local JVM.

Pig is a high level data flow language for processing very large files. It transform on sets of records. The processing of data is one step at a time. It is easier than SQL. The Hadoop Apache pig components are,

  1. Apache pig Latin
  2. Grunt Shell
  3. Apache pigServer

Apache pig Latin

The Apache pig’s infrastructure layer includes a compiler that produces sequences of Map-Reduce programs. This Apache pig’s infrastructure includes large-scale parallel implementations like the Hadoop subproject. The Apache pig’s language layer includes a textual language called Apache pig Latin. This Apache pig Latin has the following properties. A Apache pig Latin program includes of a directed acyclic graph where each node represents an operation that transform data. An operation takes in two different ways.

  1. Relational algebra style operations like join filter, project.
  2. Functional programming style operators like map, reduce.

A) Ease of programming: – It is little to extract parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

B) Optimization opportunities: – The way in which tasks are encoded allows the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

C) Extensibility: – Users can create their own functions for the special-purpose processing.

Grunt

Grunt helps the Apache pig shell operations. To run Pig’s Grunt shell in local mode, we are using the below instructions

First, point $Apache pig_CLASSPATH to the pig.jar file (in your current working directory):
<

$ export Apache pig_CLASSPATH=./pig.jar

From your current working directory, run:

$ pig -x local

The Grunt shell is invoked and you can enter commands at the prompt.

grunt> A = load 'passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;

Eg:-

Data = LOAD ‘employee.csv’ USING PigStorage() AS
{
first_name:chararray, last_name:chararray,age:int,
wage:float, department:chararray
};
grouped_by_department = GROUP data BY department;
total_wage_by_department = FOREACH grouped_by_department
GENERATE
group AS department,
COUNT(data) as employee_count,
SUM(data::wage) As total_wage;
total_ordered = ORDER total_wage_by_department BY total_wage;
total_ordered = LIMIT total_ordered 10;
books = Load ‘books.csv.bz2’ USING PigStorage () AS
{
Book_id:int, book_name:chararray,author_name:chararray
};
Book_sales = LOAD ‘book_sales.csv.bz2’ USING PiGStorage() As
{
Book_id:int,price:float,country:chararray
};
--- books = FILTER books BY(author_name Like ‘Pamuk’);
data = JOIN books ON  book_id, book_sale ON book_id PARALLEL 12;
grouped_by_book = GROUP data BY books::book_name;
total_sales_by_book = FOREACH grouped_by_book
GENERATE
group as book,
COUNT(data) as sales_volume, SUM(book_sales::price) As total_sales;

Apache pig’s are mainly used for improving the programming capabilities. The direct working with java APIs can be slow and error prone. It also prevents the usage of Hadoop to Java programmers. The above problem is solved by using the Apache pig. Apache pig is a programming language that simplifies the mutual tasks of working with Hadoop. Loading data, expressing transformations on the data, and storing the final results etc are the commonly used tasks. Apache pig’s operations can make sense of semi structured like log files, and this language is extensible using Java to append support for custom data types and transformations.

Apache pig allows the development of compact scripts for changing data flows for incorporation into big applications. Apache pig is a thinner layer. The main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs. Pig’s intended audience remains primarily the software developer.

Apache pig Advantages

  1. Apache pig helps Hadoop accessible
  2. Apache pig chains together complex job flows
  3. User defined functions are first class citizens
  4. Vibrant OS community, dedicated team at Y! improving it daily
  5. At twitter, Apache pig helps us understand our business faster