Cloudera Academic Partnership 10

Apache Hadoop – A course for undergraduates Lecture 10

© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐1

MulE-‐Dataset OperaEons with Pig Chapter 10.1


10-‐2

MulE-‐Dataset OperaEons with Pig § How to use grouping to combine data from mul:ple sources § Types of join opera:ons in Pig and how to use them § Concatena:ng records to produce a single data set § SpliBng a single data set into mul:ple rela:ons


10-‐3

Chapter Topics Mul:-‐Dataset Opera:ons with Pig §   Techniques for Combining Data Sets §   Joining Data Sets in Pig §   Set OperaEons §   SpliMng Data Sets


10-‐4

Overview of Combining Data Sets § So far, we have concentrated on processing single data sets – Valuable insight oOen results from combining mulEple data sets § Pig offers several techniques for achieving this – Using the GROUP operator with mulEple relaEons – Joining the data as you would in SQL – Performing set operaEons like CROSS and UNION § We will cover each of these in this chapter


10-‐5

Example Data Sets (1) § Most examples in this chapter will involve the same two data sets § The first is a file containing informa:on about Dualcore’s stores

Stores A B C D E F

Anchorage Boston Chicago Dallas Edmonton Fargo

§ There are two fields in this rela:on 1.  store_id:chararray (unique key) 2.  name:chararray (name of the city in which the store is located)


10-‐6

Example Data Sets (2) Stores

§ Our other data set is a file containing informa:on about Dualcore’s salespeople

A B C D E F

§ This rela:on contains three fields 1.  person_id:int (unique key) 2.  name:chararray (salesperson name) 3.  store_id:chararray (refers to store)

Anchorage Boston Chicago Dallas Edmonton Fargo Salespeople

1 2 3 4 5 6 7 8 9 10

Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack


B D F A F C D B C

10-‐7

Grouping MulEple RelaEons § We previously learned about the GROUP operator – Groups values in a relaEon based on the specified field(s) § The GROUP operator can also group mul$ple rela:ons – In this case, using the synonymous COGROUP operator is preferred grouped = COGROUP stores BY store_id, salespeople BY store_id;

§ This collects values from both data sets into a new rela:on – As before, the new relaEon is keyed by a field named group – This group field is associated with one bag for each input (group, {bag of records}, {bag of records}) store_id

records from stores

records from salespeople


10-‐8

Example of COGROUP Stores A B C D E F


grunt> grouped = COGROUP stores BY store_id, salespeople BY store_id;

Salespeople 1 2 3 4 5 6 7 8 9 10


B D F A F C D B C

grunt> DUMP grouped; (A,{(A,Anchorage)},{(4,Dieter,A)}) (B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)}) (C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)}) (D,{(D,Dallas)},{(2,Bob,D),(7,George,D)}) (E,{(E,Edmonton)},{}) (F,{(F,Fargo)},{(3,Carlos,F),(5,Étienne,F)}) (,{},{(10,Jack,)})


10-‐9

Chapter Topics Mul:-‐Dataset Opera:ons with Pig §   Techniques for Combining Data Sets §   Joining Data Sets in Pig §   Set OperaEons §   SpliMng Data Sets


10-‐10

Join Overview § The COGROUP operator creates a nested data structure § Pig La:n’s JOIN operator creates a flat data structure – Similar to joins in a relaEonal database § A JOIN is similar to doing a COGROUP followed by a FLATTEN – Though they handle null values differently


10-‐11

Key Fields § Like COGROUP, joins rely on a field shared by each rela:on joined = JOIN stores BY store_id, salespeople BY store_id;

§ Joins can also use mul:ple fields as the key joined = JOIN customers BY (name, phone_number), accounts BY (name, phone_number);


10-‐12

Inner Joins § The default JOIN in Pig La:n is an inner join joined = JOIN stores BY store_id, salespeople BY store_id;

§ An inner join outputs records only when the key is found in all inputs – In the above example, stores that have at least one salesperson § You can do an inner join on mul:ple rela:ons in a single statement – But you must use the same key to join them


10-‐13

Inner Join Example stores A B C D E F


grunt> joined = JOIN stores BY store_id, salespeople BY store_id;

salespeople 1 2 3 4 5 6 7 8 9 10


B D F A F C D B C

grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F)


10-‐14

EliminaEng Duplicate Fields (1) § As with COGROUP, the new rela:on s:ll contains duplicate fields grunt> joined = JOIN stores BY store_id, salespeople BY store_id; grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F)


10-‐15

EliminaEng Duplicate Fields (2) § We can use FOREACH...GENERATE to retain just the fields we need – However, it is now slightly more complex to reference fields – We must fully-‐qualify any fields with names that are not unique grunt> DESCRIBE joined; joined: {stores::store_id: chararray,stores::name: chararray,salespeople::person_id: int,salespeople::name: chararray,salespeople::store_id: chararray} grunt> cleaned = FOREACH joined GENERATE stores::store_id, stores::name, person_id, salespeople::name; grunt> DUMP cleaned; (A,Anchorage,4,Dieter) (B,Boston,1,Alice) (B,Boston,8,Hannah) ... (additional records omitted for brevity) ...


10-‐16

Outer Joins § Pig La:n allows you to specify the type of join following the field name – Inner joins do not specify a join type joined = JOIN relation1 BY field [LEFT|RIGHT|FULL] OUTER, relation2 BY field;

§ An outer join does not require the key to be found in both inputs § Outer joins require Pig to know the schema for at least one rela:on – Which relaEon requires schema depends on the join type – Full outer joins require schema for both relaEons


10-‐17

LeO Outer Join Example stores A B C D E F

§ Result contains all records from the rela:on specified on the le[, but only matching records from the one specified on the right


grunt> joined = JOIN stores BY store_id LEFT OUTER, salespeople BY store_id;

salespeople 1 2 3 4 5 6 7 8 9 10


B D F A F C D B C

grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (E,Edmonton,,,) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F)


10-‐18

Right Outer Join Example stores A B C D E F

§ Result contains all records from the rela:on specified on the right, but only matching records from the one specified on the le[


grunt> joined = JOIN stores BY store_id RIGHT OUTER, salespeople BY store_id;

salespeople 1 2 3 4 5 6 7 8 9 10


B D F A F C D B C

grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F) (,,10,Jack,)


10-‐19

Full Outer Join Example stores A B C D E F

§ Result contains all records where there is a match in either rela:on


grunt> joined = JOIN stores BY store_id FULL OUTER, salespeople BY store_id;

salespeople 1 2 3 4 5 6 7 8 9 10


B D F A F C D B C

grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (E,Edmonton,,,) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F) (,,10,Jack,)


10-‐20

Chapter Topics Mul:-‐Dataset Opera:ons with Pig §   Techniques for Combining Data Sets §   Joining Data Sets in Pig §   Set Opera:ons §   SpliMng Data Sets


10-‐21

Crossing Data Sets §  JOIN finds records in one rela:on that match records in another § Pig’s CROSS operator creates the cross product of both rela:ons – Combines all records in both tables regardless of matching – In other words, all possible combinaEons of records crossed = CROSS stores, salespeople;

§ Careful: This can generate huge amounts of data!


10-‐22

Cross Product Example stores A B D

§ Generates every possible combina:on of records in the stores and salespeople rela:ons

Anchorage Boston Dallas

grunt> crossed = CROSS stores, salespeople; salespeople 1 2 8 10

Alice Bob Hannah Jack

B D B

grunt> DUMP crossed; (A,Anchorage,1,Alice,B) (A,Anchorage,2,Bob,D) (A,Anchorage,8,Hannah,B) (A,Anchorage,10,Jack,) (B,Boston,1,Alice,B) (B,Boston,2,Bob,D) (B,Boston,8,Hannah,B) (B,Boston,10,Jack,) (D,Dallas,1,Alice,B) (D,Dallas,2,Bob,D) (D,Dallas,8,Hannah,B) (D,Dallas,10,Jack,)


10-‐23

ConcatenaEng Data Sets § We have explored several techniques for combining data sets – They have had one thing in common: they combine horizontally § The UNION operator combines records ver:cally – It adds data from input relaEons into a new single relaEon – Pig does not require these inputs to have the same schema – It does not eliminate duplicate records nor preserve order § This is helpful for incorpora:ng new data into your processing both = UNION june_items, july_items;


10-‐24

UNION Example june Adapter Battery Cable DVD HDTV

549 349 799 1999 79999

july Fax GPS HDTV Ink

17999 24999 65999 3999

§ Concatenates all records from june and july grunt> both = UNION june_items, july_items; grunt> DUMP both; (Fax,17999) (GPS,24999) (HDTV,65999) (Ink,3999) (Adapter,549) (Battery,349) (Cable,799) (DVD, 1999) (HDTV,79999)


10-‐25

Chapter Topics Mul:-‐Dataset Opera:ons with Pig §   Techniques for Combining Data Sets §   Joining Data Sets in Pig §   Set OperaEons §   SpliBng Data Sets


10-‐26

SpliMng Data Sets § You have learned several ways to combine data sets into a single rela:on § Some:mes you need to split a data set into mul:ple rela:ons – Server logs by date range – Customer lists by region – Product lists by vendor § Pig La:n supports this with the SPLIT operator SPLIT relation INTO relationA IF expression1, relationB IF expression2, relationC IF expression3...;

– Expressions need not be mutually exclusive


10-‐27

SPLIT Example § Split customers into groups for rewards program, based on life:me value customers Annette Bruce Charles Dustin Eva Felix Glynn Henry Ian Jeff Kai Laura Mirko

9700 23500 17800 21250 8500 9300 27800 8900 43800 29100 34000 7800 24200

grunt> SPLIT customers INTO gold_program IF ltv >= 25000, silver_program IF ltv >= 10000 AND ltv < 25000; grunt> DUMP gold_program; (Glynn,27800) (Ian,43800) (Jeff,29100) (Kai,34000) grunt> DUMP silver_program; (Bruce,23500) (Charles,17800) (Dustin,21250) (Mirko,24200)


10-‐28

EssenEal Points § You can use COGROUP to group mul:ple rela:ons – This creates a nested data structure § Pig supports common SQL join types – Inner, leO outer, right outer, and full outer – You may need to fully-‐qualify field names when using joined data § Pig’s CROSS operator creates every possible combina:on of input data – This can create huge amounts of data – use it carefully! § You can use a UNION to concatenate data sets § In addi:on to combining data sets, Pig supports spliBng them too


10-‐29

Bibliography The following offer more informa:on on topics discussed in this chapter § For Pig group and cogroup informa:on: – http://pig.apache.org/docs/r0.10.0/ basic.html#GROUP


10-‐30

Extending Pig Chapter 10.2


10-‐31

Extending Pig § How to use parameters in your Pig La:n to increase its flexibility § How to define and invoke macros to improve the reusability of your code § How to call user-‐defined func:ons from your code § How to write user-‐defined func:ons in Python § How to process data with external scripts


10-‐32

Chapter Topics Extending Pig §   Adding Flexibility with Parameters §   Macros and Imports §   UDFs §   Contributed FuncEons §   Using Other Languages to Process Data with Pig


10-‐33

The Need for Parameters (1) § Some processing is very repe::ve – For example, creaEng sales reports allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 999; bigsales_alice = FILTER bigsales BY name == 'Alice'; STORE bigsales_alice INTO 'Alice';


10-‐34

The Need for Parameters (2) § You may need to change the script slightly for each run – For example, to modify the paths or filter criteria allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 999; bigsales_alice = FILTER bigsales BY name == 'Alice'; STORE bigsales_alice INTO 'Alice';


10-‐35

Making the Script More Flexible with Parameters § Instead of hardcoding values, Pig allows you to use parameters – These are replaced with specified values at runEme allsales = LOAD '$INPUT' AS (name, price); bigsales = FILTER allsales BY price > $MINPRICE; bigsales_name = FILTER bigsales BY name == '$NAME'; STORE bigsales_name INTO '$NAME';

– Then specify the values on the command line $ pig -p INPUT=sales -p MINPRICE=999 \ -p NAME='Jo Anne' reporter.pig


10-‐36

Two Tricks for Specifying Parameter Values § You can also specify parameter values in a text file – An alternaEve to typing each one on the command line INPUT=sales MINPRICE=999 # comments look like this NAME='Alice'

– Use -m filename opEon to tell Pig which file contains the values § Parameter values can be defined with the output of a shell command – For example, to set MONTH to the current month: MONTH=`date +'%m'`

# returns 03 for March, 05 for May


10-‐37

Chapter Topics Extending Pig §   Adding Flexibility with Parameters §   Macros and imports §   UDFs §   Contributed FuncEons §   Using Other Languages to Process Data with Pig


10-‐38

The Need for Macros § Parameters simplify repe::ve code by allowing you to pass in values – But someEmes you would like to reuse the actual code too allsales = LOAD 'sales' AS (name, price); byperson = FILTER allsales BY name == 'Alice'; SPLIT byperson INTO low IF price < 1000, high IF price >= 1000; amt1 = FOREACH low GENERATE name, price * 0.07 AS amount; amt2 = FOREACH high GENERATE name, price * 0.12 AS amount; commissions = UNION amt1, amt2; grpd = GROUP commissions BY name; out = FOREACH grpd GENERATE SUM(commissions.amount) AS total;


10-‐39

Defining a Macro in Pig LaEn § Macros allow you to define a block of code to reuse easily – Similar (but not idenEcal) to a funcEon in a programming language define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT) returns result { allsales = LOAD 'sales' AS (name, price); byperson = FILTER allsales BY name == '$NAME'; SPLIT byperson INTO low if price < $SPLIT_AMT, high IF price >= $SPLIT_AMT; amt1 = FOREACH low GENERATE name, price * $LOW_PCT AS amount; amt2 = FOREACH high GENERATE name, price * $HIGH_PCT AS amount; commissions = UNION amt1, amt2; grouped = GROUP commissions BY name; $result = FOREACH grouped GENERATE SUM(commissions.amount); }; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐40

Invoking Macros § To invoke a macro, call it by name and supply values in the correct order define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT) returns result { allsales = LOAD 'sales' AS (name, price); ... (other code removed for brevity) ... $result = FOREACH grouped GENERATE SUM(commissions.amount); }; alice_comm = calc_commission('Alice', 1000, 0.07, 0.12); carlos_comm = calc_commission('Carlos', 2000, 0.08, 0.14);


10-‐41

Reusing Code with Imports § A[er defining a macro, you may wish to use it in mul:ple scripts § You can include one script within another, star:ng with Pig 0.9 – This is done with the import keyword and path to file being imported -- We saved the macro to a file named commission_calc.pig import 'commission_calc.pig'; alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);


10-‐42



10-‐43

User-‐Defined FuncEons (UDFs) § We have covered many of Pig’s built-‐in func:ons already § It is also possible to define your own func:ons – Pig allows wriEng UDFs in several languages Language

Supported in Pig Versions

Java

All

Python

0.8 and later

JavaScript (experimental)

0.9 and later

Ruby (experimental)

0.10 and later

Groovy (experimental)

0.11 and later

§ In the next few slides, you will see how to use UDFs in Java, and how to write and use UDFs in Python


10-‐44

Using UDFs Wri>en in Java § UDFs are packaged into Java Archive (JAR) files § There are only two required steps for using them – Register the JAR file(s) containing the UDF and its dependencies – Invoke the UDF using the fully-‐qualified classname REGISTER '/path/to/myudf.jar'; ... data = FOREACH allsales GENERATE com.example.MYFUNC(name);

§ You can op:onally define an alias for the func:on REGISTER '/path/to/myudf.jar'; DEFINE FOO com.example.MYFUNC; ... data = FOREACH allsales GENERATE FOO(name);


10-‐45

WriEng UDFs in Python (1) § Now we will see how to write a UDF in Python § The data we want to process has inconsistent phone number formats

Alice Bob Carlos David

(314) 555-1212 212.555.9753 405-555-3912 (202) 555.8471

§ We will write a Python UDF that can consistently extract the area code


10-‐46

WriEng UDFs in Python (2) § Our Python code is straighiorward § The only unusual thing is the op:onal @outputSchema decorator – This tells Pig what data type we are returning – If not specified, Pig will assume bytearray @outputSchema("areacode:chararray") def get_area_code(phone): areacode = "???" # return this for unknown formats if len(phone) == 12: # XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format areacode = phone[0:3] elif len(phone) == 14: # (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format areacode = phone[1:4] return areacode


10-‐47

Invoking Python UDFs from Pig LaEn § Using this UDF from our Pig La:n is also easy – We saved our Python code as phonenumber.py – This Python file is in our current directory REGISTER 'phonenumber.py' USING jython AS phoneudf; names = LOAD 'names' AS (name:chararray, phone:chararray); areacodes = FOREACH names GENERATE phoneudf.get_area_code(phone) AS ac;


10-‐48

Chapter Topics Extending Pig §   Adding Flexibility with Parameters §   Macros and Imports §   UDFs §   Contributed Func:ons §   Using Other Languages to Process Data with Pig


10-‐49

Open Source UDFs § Pig ships with a set of community-‐contributed UDFs called Piggy Bank § Another popular package of UDFs, called DataFu, has been open-‐sourced by LinkedIn


10-‐50

Piggy Bank § Piggy Bank ships with Pig – You will need to register the piggybank.jar file – The locaEon may vary depending on source and version – In CDH on our VMs, it is at /usr/lib/pig/piggybank.jar § Some UDFs in Piggy Bank include (package names omiked for brevity) Class Name

Descrip:on

ISOToUnix

Converts an ISO 8601 date/Eme format to UNIX format

UnixToISO

Converts a UNIX date/Eme format to ISO 8601 format

LENGTH

Returns the number of characters in the supplied string

HostExtractor

Returns the host name from a URL

DiffDate

Returns number of days between two dates


10-‐51

DataFu § DataFu does not ship with Pig, but is part of CDH 4.1.0 and later – You will need to register the DataFu JAR file – In VM, at /usr/lib/pig/datafu-0.0.4-cdh4.2.0.jar § Some UDFs in DataFu include (package names omiked for brevity) Class Name

Descrip:on

Quantile

Calculates quanEles for a data set

Median

Calculates the median for a data set

Sessionize

Groups data into sessions based on a specified Eme window

HaversineDistInMiles

Calculates distance in miles between two points, given laEtude and longitude


10-‐52

Using A Contributed UDF § Here is an example of using a UDF from DataFu to calculate distance 37.789336

-122.401385

40.707555

-74.011679

REGISTER '/usr/lib/pig/datafu-*.jar'; DEFINE DIST datafu.pig.geo.HaversineDistInMiles;

Input data

Pig LaEn

places = LOAD 'data' AS (lat1:double, lon1:double, lat2:double, lon2:double); dist = FOREACH places GENERATE DIST(lat1, lon1, lat2, lon2); DUMP dist;

(2564.207116295711)

Output data


10-‐53



10-‐54

Processing Data with An External Script § While Pig La:n is powerful, some tasks are easier in another language § Pig allows you to stream data through another language for processing – This is done using the STREAM keyword § Similar concept to Hadoop Streaming – Data is supplied to the script on standard input as tab-‐delimited fields – Script writes results to standard output as tab-‐delimited fields


10-‐55

STREAM Example in Python (1) § Our example will calculate a user’s age given that user’s birthdate – This calculaEon is done in a Python script named agecalc.py § Here is the corresponding Pig La:n code – BackEcks used to quote script name following the alias – Single quotes used for quoEng script name within SHIP – The schema for the data produced by the script follows the AS keyword DEFINE MYSCRIPT àgecalc.py` SHIP('agecalc.py'); users = LOAD 'data' AS (name:chararray, birthdate:chararray); out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int); DUMP out;


10-‐56

STREAM Example in Python (2) § Python code for agecalc.py #!/usr/bin/env python import sys from datetime import datetime for line in sys.stdin: line = line.strip() (name, birthdate) = line.split("\t") d1 = datetime.strptime(birthdate, '%Y-%m-%d') d2 = datetime.now() age = int((d2 - d1).days / 365) print "%s\t%i" % (name, age)


10-‐57

STREAM Example in Python (3) § The Pig script again, and the data it reads and writes DEFINE MYSCRIPT àgecalc.py` SHIP('agecalc.py'); users = LOAD 'data' AS (name:chararray, birthdate:chararray); out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int); DUMP out;

Output data

Input data andy betty chuck debbie

1963-11-15 1985-12-30 1979-02-23 1982-09-19

(andy,49) (betty,27) (chuck,34) (debbie,30)


10-‐58

EssenEal Points § Pig supports several extension mechanisms § Parameters and macros can help make your code more reusable – And easier to maintain and share with others § Piggy Bank and DataFu are two examples of open source UDFs – You can also write your own UDFs § It is also possible to embed Pig within another language


10-‐59

Bibliography The following offer more informa:on on topics discussed in this chapter § Documenta:on on Parameter Subs:tu:on in Pig – http://tiny.cloudera.com/dac07a § Documenta:on on Macros in Pig – http://tiny.cloudera.com/dac07b § Documenta:on on User-‐Defined Func:ons in Pig – http://tiny.cloudera.com/dac07c § Documenta:on on Piggy Bank – http://tiny.cloudera.com/dac07d § Introducing Data Fu – http://tiny.cloudera.com/dac07e


10-‐60

Bibliography (cont’d) The following offer more informa:on on topics discussed in this chapter § Details about the other supported languages can be found in the latest documenta:on on the Pig Web site: – http://pig.apache.org/docs/r0.11.1/udf.html § For details on Python UDFs, see – http://pig.apache.org/docs/r0.10.0/ udf.html#python-udfs


10-‐61

Pig TroubleshooEng and OpEmizaEon Chapter 10.3


10-‐62

Pig TroubleshooEng and OpEmizaEon § How to control the informa:on that Pig and Hadoop write to log files § How Hadoop’s Web UI can help you troubleshoot failed jobs § How to use SAMPLE and ILLUSTRATE to test and debug Pig jobs § How Pig creates MapReduce jobs from your Pig La:n code § How several simple changes to your Pig La:n code can make it run faster § Which resources are especially helpful for troubleshoo:ng Pig errors


10-‐63

Chapter Topics Pig Troubleshoo:ng And Op:miza:on §   Troubleshoo:ng Pig §   Logging §   Using Hadoop’s Web UI §   Data Sampling and Debugging §   Performance Overview §   Understanding the ExecuEon Plan §   Tips for Improving the Performance of Your Pig Jobs


10-‐64

TroubleshooEng Overview § We have now covered how to use Pig for data analysis – Unfortunately, someEmes your code may not work as you expect – It is important to remember that Pig and Hadoop are intertwined § Here we will cover some techniques for isola:ng and resolving problems – We will start with a few opEons to the Pig command


10-‐65

Helping Yourself § We will discuss some op:ons for the pig command in this chapter – You can view all of them by using the -h (help) opEon – Keep in mind that many opEons are advanced or rarely used § One useful op:on is -c (check), which validates the syntax of your code $ pig -c myscript.pig myscript.pig syntax OK

§ The -dryrun op:on is very helpful if you use parameters or macros $ pig -p INPUT=demodata -dryrun myscript.pig

– Creates a myscript.pig.substituted file in current directory


10-‐66

GeMng Help from Others § Some:mes you may need help from others – Mailing lists or newsgroups – Forums and bulleEn board sites – Support services § You will probably need to provide the version of Pig and Hadoop you are using $ pig -version Apache Pig version 0.10.0-cdh4.2.0 $ hadoop version Hadoop 2.0.0-cdh4.2.0


10-‐67

Chapter Topics Pig Troubleshoo:ng And Op:miza:on §   TroubleshooEng Pig §   Logging §   Using Hadoop’s Web UI §   Data Sampling and Debugging §   Performance Overview §   Understanding the ExecuEon Plan §   Tips for Improving the Performance of Your Pig Jobs


10-‐68

Customizing Log Messages § You may wish to change how much informa:on is logged – A recent change in Hadoop can cause lots of warnings when using Pig § Pig and Hadoop use the Log4J library, which is easily customized § Edit the /etc/pig/conf/log4j.properties file to include: log4j.logger.org.apache.pig=ERROR log4j.logger.org.apache.hadoop.conf.Configuration=ERROR

§ Edit the /etc/pig/conf/pig.properties file to set this property: log4jconf=/etc/pig/conf/log4j.properties


10-‐69

Customizing Log Messages on a Per-‐Job Basis § O[en you just want to temporarily change the log level – Especially while trying to troubleshoot a problem with your script § You can specify a Log4J proper:es file to use when you invoke Pig – This overrides the default Log4J configuraEon § Create a customlog.properties file to include: log4j.logger.org.apache.pig=DEBUG

§ Specify this file via the -log4jconf argument to Pig $ pig -log4jconf customlog.properties


10-‐70

Controlling Client-‐Side Log Files § When a job fails, Pig may produce a log file to explain why – These are typically produced in your current directory § To use a different loca:on, use the -l (log) op:on when star:ng Pig $ pig -l /tmp

§ Or set it permanently by edi:ng /etc/pig/conf/pig.properties – Specify a different directory using the log.file property log.file=/tmp


10-‐71



10-‐72

The Hadoop Web UI § Each Hadoop daemon has a corresponding Web applica:on – This allows us to easily see cluster and job status with a browser – In pseudo-‐distributed mode, the hostname is localhost

HDFS MR1 MR2

Daemon Name

Address

NameNode

http://hostname:50070/

DataNode


JobTracker


TaskTracker


ResourceManager http://hostname:8088/ NodeManager



10-‐73

The JobTracker Web UI (1) § The JobTracker offers the most useful of Hadoop’s Web UIs – It displays MapReduce status informaEon for the Hadoop cluster


10-‐74

The JobTracker Web UI (2) § The JobTracker Web UI also shows historical informa:on – You can click one of the links to see details for a parEcular job


10-‐75

The JobTracker Web UI (3) § The job detail page can help you troubleshoot a problem


10-‐76

Naming Your Job § Hadoop clusters are typically shared resources – There might be dozens or hundreds of others using it – As a result, someEmes it is hard to find your job in the Web UI § We recommend seBng a name in your scripts to help iden:fy your jobs – Set the job.name property, either in Grunt or your script grunt> set job.name 'Q2 2013 Sales Reporter';


10-‐77

Killing a Job § A job that processes a lot of data can take hours to complete – SomeEmes you spot an error in your code just aOer submiMng a job – Rather than wait for the job to complete, you can kill it § First, find the Job ID on the front page of the JobTracker Web UI

§ Then, use the kill command in Pig along with that Job ID grunt> kill job_201303151454_0028


10-‐78



10-‐79

Using SAMPLE to Create a Smaller Data Set § Your code might process terabytes of data in produc:on – However, it is convenient to test with smaller amounts during development § Use SAMPLE to choose a random set of records from a data set § This example selects about 5% of records from bigdata – Stores them in a new directory called mysample everything = LOAD 'bigdata'; subset = SAMPLE everything 0.05; STORE subset INTO 'mysample';


10-‐80

Intelligent Sampling with ILLUSTRATE § Some:mes a random sample may lack data needed for tes:ng – For example, matching records in two data sets for a JOIN operaEon § Pig’s ILLUSTRATE keyword can do more intelligent sampling – Pig will examine the code to determine what data is needed – It picks a few records that properly exercise the code § You should specify a schema when using ILLUSTRATE – Pig will generate records when yours don’t suffice


10-‐81

Using ILLUSTRATE Helps You to Understand Data Flow § Like DUMP and DESCRIBE, ILLUSTRATE aids in debugging – The syntax is the same for all three grunt> allsales = LOAD 'sales' AS (name:chararray, price:int); grunt> bigsales = FILTER allsales BY price > 999; grunt> ILLUSTRATE bigsales; (Bob,3625) -------------------------------------------------| allsales | name:chararray | price:int | -------------------------------------------------| | Bob | 3625 | | | Bob | 998 | --------------------------------------------------------------------------------------------------| bigsales | name:chararray | price:int | -------------------------------------------------| | Bob | 3625 | -------------------------------------------------© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐82

General Debugging Strategies § Use DUMP, DESCRIBE, and ILLUSTRATE o[en – The data might not be what you think it is § Look at a sample of the data – Verify that it matches the fields in your LOAD specificaEon § Other helpful steps for tracking down a problem – Use -dryrun to see the script aOer parameters and macros are processed – Test external scripts (STREAM) by passing some data from a local file – Look at the logs, especially task logs available from the Web UI


10-‐83



10-‐84

Performance Overview § We have discussed several techniques for finding errors in Pig La:n code – Once you get your code working, you’ll oOen want it to work faster § Performance tuning is a broad and complex subject – Requires a deep understanding of Pig, Hadoop, Java, and Linux – Typically the domain of engineers and system administrators § Most of these topics are beyond the scope of this course – We’ll cover the basics and offer several performance improvement Eps – See Programming Pig (chapters 7 and 8) for detailed coverage


10-‐85

Chapter Topics Pig Troubleshoo:ng And Op:miza:on §   TroubleshooEng Pig §   Logging §   Using Hadoop’s Web UI §   Data Sampling and Debugging §   Performance Overview §   Understanding the Execu:on Plan §   Tips for Improving the Performance of your Pig jobs


10-‐86

How Pig LaEn Becomes a MapReduce Job § Pig La:n code ul:mately runs as MapReduce jobs on the Hadoop cluster § However, Pig does not translate your code into Java MapReduce – Much like relaEonal databases don’t translate SQL to C language code – Like a database, Pig interprets the Pig LaEn to develop execuEon plans – Pig’s execuEon engine uses these to submit MapReduce jobs to Hadoop § The EXPLAIN keyword details Pig’s three execu:on plans – Logical – Physical – MapReduce § Seeing an example job will help us beker understand EXPLAIN’s output


10-‐87

DescripEon of Our Example Code and Data § Our goal is to produce a list of per-‐store sales stores A B C D E F

Anchorage Boston Chicago Dallas Edmonton Fargo sales

A D A B A B E D

1999 2399 4579 6139 2489 3699 2479 5799

grunt> stores = LOAD 'stores' AS (store_id:chararray, name:chararray); grunt> sales = LOAD 'sales' AS (store_id:chararray, price:int); grunt> groups = GROUP sales BY store_id; grunt> totals = FOREACH groups GENERATE group, SUM(sales.price) AS amount; grunt> joined = JOIN totals BY group, stores BY store_id; grunt> result = FOREACH joined GENERATE name, amount; grunt> DUMP result; (Anchorage,9067) (Boston,9838) (Dallas,8198) (Edmonton,2479)


10-‐88

Using the EXPLAIN Keyword § Using EXPLAIN rather than DUMP will show the execu:on plans grunt> DUMP result; (Anchorage,9067) (Boston,9838) (Dallas,8198) (Edmonton,2479) grunt> EXPLAIN result; #----------------------------------------------# New Logical Plan: #----------------------------------------------result: (Name: LOStore Schema: stores::name#49:chararray,totals::amount#70:long) | |---result: (Name: LOForEach Schema: stores::name#49:chararray,totals::amount#70:long) (other lines, including physical and MapReduce plans, would follow) © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐89

Chapter Topics Pig Troubleshoo:ng And Op:miza:on §   TroubleshooEng Pig §   Logging §   Using Hadoop’s Web UI §   Data Sampling and Debugging §   Performance Overview §   Understanding the ExecuEon Plan §   Tips for Improving the Performance of your Pig Jobs


10-‐90

Pig’s RunEme OpEmizaEons § Pig does not necessarily run your statements exactly as you wrote them § It may remove opera:ons for efficiency sales = LOAD 'sales' AS (store_id:chararray, price:int); unused = FILTER sales BY price > 789; DUMP sales;

§ It may also rearrange opera:ons for efficiency grouped = GROUP sales BY store_id; totals = FOREACH grouped GENERATE group, SUM(sales.price); joined = JOIN totals BY group, stores BY store_id; only_a = FILTER joined BY store_id == 'A'; DUMP only_a;


10-‐91

OpEmizaEons You Can Make in Your Pig LaEn Code § Pig’s op:mizer does what it can to improve performance – But you know your own code and data be>er than it does – A few small changes in your code can allow addiEonal opEmizaEons § On the next few slides, we will rewrite this Pig code for performance stores = LOAD 'stores' AS (store_id, name, postcode, phone); sales = LOAD 'sales' AS (store_id, price); joined = JOIN sales BY store_id, stores BY store_id; DUMP joined; groups = GROUP joined BY sales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton'; sorted = ORDER region BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐92

Don’t Produce Output You Don’t Really Need § In this case, we forgot to remove the DUMP statement – SomeEmes happens when moving from development to producEon – And it might go unnoEced if you’re not watching the terminal

stores = LOAD 'stores' AS (store_id, name, postcode, phone); sales = LOAD 'sales' AS (store_id, price); joined = JOIN sales BY store_id, stores BY store_id; DUMP joined; groups = GROUP joined BY sales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton'; sorted = ORDER region BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐93

Specify Schema Whenever Possible § Specifying schema when loading data eliminates the need for Pig to guess – It may choose a ‘bigger’ type than you need (e.g., long instead of int) § The postcode and phone fields in the stores data set were also never used – EliminaEng them in our schema ensures they’ll be omi>ed in joined stores sales joined groups totals

= = = = =

LOAD 'stores' AS (store_id:chararray, name:chararray); LOAD 'sales' AS (store_id:chararray, price:int); JOIN sales BY store_id, stores BY store_id; GROUP joined BY sales::store_id; FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton'; sorted = ORDER region BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore';


10-‐94

Filter Unwanted Data As Early As Possible § We previously did our JOIN before our FILTER – This produced lots of data we ulEmately discarded – Moving the FILTER operaEon up makes our script more efficient – Caveat: We now have to filter by store ID rather than store name stores = sales = regsales joined = groups = totals =

LOAD 'stores' AS (store_id:chararray, name:chararray); LOAD 'sales' AS (store_id:chararray, price:int); = FILTER sales BY store_id == 'A' OR store_id == 'E'; JOIN regsales BY store_id, stores BY store_id; GROUP joined BY regsales::store_id; FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; sorted = ORDER unique BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore';


10-‐95

Consider AdjusEng the ParallelizaEon § Hadoop clusters scale by processing data in parallel – Newer Pig releases choose the number of reducers based on input size – However, it is oOen beneficial to set a value explicitly in your script – Your system administrator can help you determine the best value set default_parallel 5; stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); regsales = FILTER sales BY store_id == 'A' OR store_id == 'E'; joined = JOIN regsales BY store_id, stores BY store_id; groups = GROUP joined BY regsales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; sorted = ORDER unique BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐96

Specify the Smaller Data Set First in a Join § We can op:mize joins by specifying the larger data set last – Pig will ‘stream’ the larger data set instead of reading it into memory – In our case, we have far more records in sales than in stores – Changing the order in the JOIN statement can boost performance set default_parallel 5; stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); regsales = FILTER sales BY store_id == 'A' OR store_id == 'E'; joined = JOIN stores BY store_id, regsales BY store_id; groups = GROUP joined BY regsales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; sorted = ORDER unique BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐97

Try Using Compression on Intermediate Data § Pig scripts o[en yield jobs with both a Map and a Reduce phase – Remember that Mapper output becomes Reducer input – Compressing this intermediate data is easy and can boost performance – Your system administrator may need to install a compression library set mapred.compress.map.output true; set mapred.map.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; set default_parallel 5; stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); regsales = FILTER sales BY store_id == 'A' OR store_id == 'E'; joined = JOIN stores BY store_id, regsales BY store_id; groups = GROUP joined BY regsales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; ... (other lines unchanged, but removed for brevity) ... © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐98

A Few More Tips for Improving Performance § Main theme: Eliminate unnecessary data as early as possible – Use FOREACH ... GENERATE to select just those fields you need – Use ORDER BY and LIMIT when you only need a few records – Use DISTINCT when you don’t need duplicate records § Dropping records with NULL keys before a join can boost performance – These records will be eliminated in the final output anyway – But Pig doesn’t discard them unEl aOer the join – Use FILTER to remove records with null keys before the join stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); nonnull_stores = FILTER stores BY store_id IS NOT NULL; nonnull_sales = FILTER sales BY store_id IS NOT NULL; joined = JOIN nonnull_stores BY store_id, nonnull_sales BY store_id; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

10-‐99

EssenEal Points § You can boost performance by elimina:ng unneeded data during processing § Pig’s error messages don’t always clearly iden:fy the source of a problem – We recommend tesEng your scripts with a small data sample – Looking at the Web UI, and especially the log messages, can be helpful § The resources listed on the upcoming bibliography slide may further assist you in solving problems


10-‐10 0

Bibliography The following offer more informa:on on topics discussed in this chapter § Hadoop Log Loca:on and Reten:on – http://tiny.cloudera.com/dac08a § Pig Tes:ng and Diagnos:cs – http://tiny.cloudera.com/dac08b § Mailing List for Pig Users – http://tiny.cloudera.com/dac08c § Ques:ons Tagged with “Pig” on StackOverflow – http://tiny.cloudera.com/dac08d § Ques:ons Tagged with “PigLa:n” on StackOverflow – http://tiny.cloudera.com/dac08e


10-‐10 1

Bibliography The following offer more informa:on on topics discussed in this chapter § Performance :ps for Pig – http://pig.apache.org/docs/r0.10.0/ perf.html#performance-enhancers


10-‐10 2

Cloudera Academic Partnership 10

Recommend Documents