Loading, please wait...

A to Z Full Forms and Acronyms

Pig Scripting in Big Data-Big Data - (PART - 5)

Sep 17, 2019 Big Data, Pig, September Series, 2861 Views
In this article, we will discuss Pig

Pig

Pig is a scripting platform designed to process and analyze large datasets, and it runs on Hadoop clusters. Pig is extensible, easily programmed, and self-optimizing.

Why Pig?

Before 2006, programs were written only on MapReduce using Java. Developers used to face a lot of challenges like:

  • Code difficulty
  • Rigid Dataflow
  • Need for common operation
  • Fundamentals while creating a program

Pig was developed to overcome these challenges.

Features of Pig:

  • Supports UDFs and data types.
  • Provide step-by-step procedural control
  • Schemas can be assigned dynamically.

Pig Architecture

 

 

 

  • Parser - It does the type checking and checks the syntax of the script.
  • Optimizer - It performs activities like split, merge, transform, recorder, etc.
  • Compiler - It compiles the optimized code into a series of MapReduce jobs.
  • Execution Engine - It executes the MapReduce jobs on Hadoop to produce the desired results.

Stages of Pig Operations

  • Load data and write Pig script:

        A= Load 'mytxt' AS(x,y,z);

        B= Filter A by x>0;

        Store B INTO 'output';

  • Pig Operations:

        1. Parses and check the script

        2. Optimizes the script

        3. Plans execution

        4. Submits to Hadoop

        5. Monitors job progress

  • Execution of the plan:

        Results are stored in HDFS or dumped on screen.

Data Model Supported by Pig

  • Atom - A simple atomic value. Example: 'Steve'.
  • Tuple - A sequence of fields that can be of any data type. Example: ('Steve', 142)
  • Bag - A collection of tuples of potentially varying structures that can contain duplicates. Example: {('Steve'),('Roger',(45,78))}
  • Map - An associative array- the key must be a chararray, but the value can be of any type. Example: [name#Steve, age#32]

Pig Execution Modes

  • Local Mode - Pig depends on the OS file system.
  • MapReduce Mode - Pig depends on the HDFS.

Conclusion

  • Pig is a high-level data flow scripting language and it has two major components: Runtime engine and Pig Latin language.
  • Pig runs in two execution modes: MapReduce and Local.
  • All the three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.
A to Z Full Forms and Acronyms

Related Article