Lesson 3: Apache Pig

Learn basics of Apache Pig

Concepts

Concept 3.1: Introductory Pig script

PART 1: CONCEPTUAL DISCUSSION

Introduction to Pig Scripting

Apache Pig is a platform for analyzing large datasets in a distributed computing environment. Pig scripts are written in a language called Pig Latin, which is specifically designed for processing and analyzing data. In this introductory lesson, we will focus on a simple Pig script that demonstrates how to load data, define columns, filter data, and store the results.

Loading Data

The first step in any Pig script is to load the data from a source. In this example, we will load data from a folder using PigStorage with a comma delimiter. This means that the data in the folder is assumed to be in CSV format, with columns separated by commas.

Defining Columns

Once the data is loaded, we need to define the columns that we want to work with. In this script, we will define three columns from the loaded data.

Filtering Data

Next, we will filter the data based on a numerical column. This means that we will select only the rows where a specific numerical column meets certain criteria.

Dumping Data

After filtering the data, we will use the DUMP command to output the results to the console. This allows us to see the filtered data before proceeding to the next step.

Storing Data

Finally, we will store the filtered data using PigStorage with a pipe delimiter. This means that the data will be saved in a format where columns are separated by pipes (|).

PART 2: CODE SAMPLE

Code Sample

-- Load data from folder
data = LOAD 'folder' USING PigStorage(',') AS (col1:chararray, col2:int, col3:float);

-- Define columns
columns = FOREACH data GENERATE col1, col2, col3;

-- Filter data by numerical column
filtered_data = FILTER columns BY col2 > 10;

-- Dump filtered data
DUMP filtered_data;

-- Store filtered data
STORE filtered_data INTO 'output_folder' USING PigStorage('|');