Understanding pig scripting

#hadoop #pig #pigfundamentals

Prajakta kulkarni Oct 30 2020 · 1 min read
Share this

what is pig?

Pig is an open-source high-level dataflow system. Provides a simple language for queries and data manipulation - Pig Latin

Effectively performs ad-hoc processing and analysis of huge data

Why PIG – When I Have MR?

MapReduce requires multiple stages, leading to long developmental cycles. In MR users have to reinvent common functionality (join, filter, etc.)

Pig provides them as inbuilt function

In pig what time is taken to write code is much less than MapReduce.

sql vs pig

PIG is Dataflow Language and SQL is declarative

PIG extract the data from its original data sources directly during execution whereas SQL needs data to be loaded physically, into the DB tables

pig doesn't support schema whereas sql support strict schema.

Pig is operational for structure and semi structure data where as sql operational for relational database management system.

sql example:

select name, ipaddress

from users join clicks on(users.name==clicks.user)

where value>0

pig example:

users=Load ' users' as (name,age,ipaddress);                                 

   clicks  =  Load   'clicks' as  (user,url,values);

valuable clicks=filter clicks by value>0;

userclicks=join users by name, valuable clicks by user;

components of pig:

Pig Latin-program

It is made up of a series of operations or transformations that are applied to the input data to produce output

pig turns the transformation into a series of mapreduce job.

2 ways of executing pig script

Grunt: Grunt is an interactive shell for running Pig commands. It is also possible to run Pig scripts from within Grunt using run and exec (execute)

Script: Pig can run a script file that contains Pig commands Eg.: pig xyz.pig  runs the commands in the local file xyz.pig

Pig data types

Pig complex dataypes:

Pig operations:

I/o operation:  load dump store

filtering  :filter,distict,for each...generate,Limit

join/grouping:  join,cross,group,cogroup

sorting: order

union/split:  UNION,SPLIT

eval function: min,max,sum,count,tokanize

Misc: flatten,sample

Read next