Pig Scripting in Big Data-Big Data - (PART - 5)

Sep 17, 2019 Big Data, Pig, September Series, 4338 Views

In this article, we will discuss Pig

Pig

Pig is a scripting platform designed to process and analyze large datasets, and it runs on Hadoop clusters. Pig is extensible, easily programmed, and self-optimizing.

Why Pig?

Before 2006, programs were written only on MapReduce using Java. Developers used to face a lot of challenges like:

Code difficulty
Rigid Dataflow
Need for common operation
Fundamentals while creating a program

Pig was developed to overcome these challenges.

Features of Pig:

Supports UDFs and data types.
Provide step-by-step procedural control
Schemas can be assigned dynamically.

Pig Architecture

Parser - It does the type checking and checks the syntax of the script.
Optimizer - It performs activities like split, merge, transform, recorder, etc.
Compiler - It compiles the optimized code into a series of MapReduce jobs.
Execution Engine - It executes the MapReduce jobs on Hadoop to produce the desired results.

Stages of Pig Operations

Load data and write Pig script:

A= Load 'mytxt' AS(x,y,z);

B= Filter A by x>0;

Store B INTO 'output';

Pig Operations:

1. Parses and check the script

2. Optimizes the script

3. Plans execution

4. Submits to Hadoop

5. Monitors job progress

Execution of the plan:

Results are stored in HDFS or dumped on screen.

Data Model Supported by Pig

Atom - A simple atomic value. Example: 'Steve'.
Tuple - A sequence of fields that can be of any data type. Example: ('Steve', 142)
Bag - A collection of tuples of potentially varying structures that can contain duplicates. Example: {('Steve'),('Roger',(45,78))}
Map - An associative array- the key must be a chararray, but the value can be of any type. Example: [name#Steve, age#32]

Pig Execution Modes

Local Mode - Pig depends on the OS file system.
MapReduce Mode - Pig depends on the HDFS.

Conclusion

Pig is a high-level data flow scripting language and it has two major components: Runtime engine and Pig Latin language.
Pig runs in two execution modes: MapReduce and Local.
All the three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.

Pig Scripting in Big Data-Big Data - (PART - 5)

Pig

Why Pig?

Features of Pig:

Pig Architecture

Stages of Pig Operations

Data Model Supported by Pig

Pig Execution Modes

Conclusion

Related Article

Advertisement

COMPANY

CONTRIBUTE

Related Article

Run Word Count Java Mapreduce Program in Hadoop

Implementation of basic Hadoop commands

Evaluating execution time for multiplication of various multi-dimensional matrix in Hadoop.

A Comparative of Traditional RDBMS and HiveQL in Hadoop Enviromnent

Introduction to NoSQL Database

Pig Scripting in Big Data-Big Data - (PART - 5)

Pig

Why Pig?

Features of Pig:

Pig Architecture

Stages of Pig Operations

Data Model Supported by Pig

Pig Execution Modes

Conclusion

Related Article

Advertisement

COMPANY

JOIN TUTORIALS LINK

Our Newsletter Will Let You Know When Any NewArticles, Tutorials and Video Are Released.

CONTRIBUTE

Follow us

Our Newsletter Will Let You Know When Any New
Articles, Tutorials and Video Are Released.