what is pig?
Pig is an open-source high-level dataflow system. Provides a simple language for queries and data manipulation - Pig Latin
Effectively performs ad-hoc processing and analysis of huge data
Why PIG – When I Have MR?
MapReduce requires multiple stages, leading to long developmental cycles. In MR users have to reinvent common functionality (join, filter, etc.)
Pig provides them as inbuilt function
In pig what time is taken to write code is much less than MapReduce.
sql vs pig
PIG is Dataflow Language and SQL is declarative
PIG extract the data from its original data sources directly during execution whereas SQL needs data to be loaded physically, into the DB tables
pig doesn't support schema whereas sql support strict schema.
Pig is operational for structure and semi structure data where as sql operational for relational database management system.
select name, ipaddress
from users join clicks on(users.name==clicks.user)
users=Load ' users' as (name,age,ipaddress);
clicks = Load 'clicks' as (user,url,values);
valuable clicks=filter clicks by value>0;
userclicks=join users by name, valuable clicks by user;
components of pig:
It is made up of a series of operations or transformations that are applied to the input data to produce output
pig turns the transformation into a series of mapreduce job.
2 ways of executing pig script
Grunt: Grunt is an interactive shell for running Pig commands. It is also possible to run Pig scripts from within Grunt using run and exec (execute)
Script: Pig can run a script file that contains Pig commands Eg.: pig xyz.pig runs the commands in the local file xyz.pig
Pig data types
Pig complex dataypes:
I/o operation: load dump store
filtering :filter,distict,for each...generate,Limit
eval function: min,max,sum,count,tokanize