Apache Hadoop – A course for undergraduates Lecture 10
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐1
MulE-‐Dataset OperaEons with Pig Chapter 10.1
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐2
MulE-‐Dataset OperaEons with Pig § How to use grouping to combine data from mul:ple sources § Types of join opera:ons in Pig and how to use them § Concatena:ng records to produce a single data set § SpliBng a single data set into mul:ple rela:ons
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐3
Chapter Topics Mul:-‐Dataset Opera:ons with Pig § Techniques for Combining Data Sets § Joining Data Sets in Pig § Set OperaEons § SpliMng Data Sets
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐4
Overview of Combining Data Sets § So far, we have concentrated on processing single data sets – Valuable insight oOen results from combining mulEple data sets § Pig offers several techniques for achieving this – Using the GROUP operator with mulEple relaEons – Joining the data as you would in SQL – Performing set operaEons like CROSS and UNION § We will cover each of these in this chapter
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐5
Example Data Sets (1) § Most examples in this chapter will involve the same two data sets § The first is a file containing informa:on about Dualcore’s stores
Stores A B C D E F
Anchorage Boston Chicago Dallas Edmonton Fargo
§ There are two fields in this rela:on 1. store_id:chararray (unique key) 2. name:chararray (name of the city in which the store is located)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐6
Example Data Sets (2) Stores
§ Our other data set is a file containing informa:on about Dualcore’s salespeople
A B C D E F
§ This rela:on contains three fields 1. person_id:int (unique key) 2. name:chararray (salesperson name) 3. store_id:chararray (refers to store)
Anchorage Boston Chicago Dallas Edmonton Fargo Salespeople
1 2 3 4 5 6 7 8 9 10
Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
B D F A F C D B C
10-‐7
Grouping MulEple RelaEons § We previously learned about the GROUP operator – Groups values in a relaEon based on the specified field(s) § The GROUP operator can also group mul$ple rela:ons – In this case, using the synonymous COGROUP operator is preferred grouped = COGROUP stores BY store_id, salespeople BY store_id;
§ This collects values from both data sets into a new rela:on – As before, the new relaEon is keyed by a field named group – This group field is associated with one bag for each input (group, {bag of records}, {bag of records}) store_id
records from stores
records from salespeople
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐8
Example of COGROUP Stores A B C D E F
Anchorage Boston Chicago Dallas Edmonton Fargo
grunt> grouped = COGROUP stores BY store_id, salespeople BY store_id;
Salespeople 1 2 3 4 5 6 7 8 9 10
Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack
B D F A F C D B C
grunt> DUMP grouped; (A,{(A,Anchorage)},{(4,Dieter,A)}) (B,{(B,Boston)},{(1,Alice,B),(8,Hannah,B)}) (C,{(C,Chicago)},{(6,Fredo,C),(9,Irina,C)}) (D,{(D,Dallas)},{(2,Bob,D),(7,George,D)}) (E,{(E,Edmonton)},{}) (F,{(F,Fargo)},{(3,Carlos,F),(5,Étienne,F)}) (,{},{(10,Jack,)})
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐9
Chapter Topics Mul:-‐Dataset Opera:ons with Pig § Techniques for Combining Data Sets § Joining Data Sets in Pig § Set OperaEons § SpliMng Data Sets
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐10
Join Overview § The COGROUP operator creates a nested data structure § Pig La:n’s JOIN operator creates a flat data structure – Similar to joins in a relaEonal database § A JOIN is similar to doing a COGROUP followed by a FLATTEN – Though they handle null values differently
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐11
Key Fields § Like COGROUP, joins rely on a field shared by each rela:on joined = JOIN stores BY store_id, salespeople BY store_id;
§ Joins can also use mul:ple fields as the key joined = JOIN customers BY (name, phone_number), accounts BY (name, phone_number);
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐12
Inner Joins § The default JOIN in Pig La:n is an inner join joined = JOIN stores BY store_id, salespeople BY store_id;
§ An inner join outputs records only when the key is found in all inputs – In the above example, stores that have at least one salesperson § You can do an inner join on mul:ple rela:ons in a single statement – But you must use the same key to join them
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐13
Inner Join Example stores A B C D E F
Anchorage Boston Chicago Dallas Edmonton Fargo
grunt> joined = JOIN stores BY store_id, salespeople BY store_id;
salespeople 1 2 3 4 5 6 7 8 9 10
Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack
B D F A F C D B C
grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐14
EliminaEng Duplicate Fields (1) § As with COGROUP, the new rela:on s:ll contains duplicate fields grunt> joined = JOIN stores BY store_id, salespeople BY store_id; grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐15
EliminaEng Duplicate Fields (2) § We can use FOREACH...GENERATE to retain just the fields we need – However, it is now slightly more complex to reference fields – We must fully-‐qualify any fields with names that are not unique grunt> DESCRIBE joined; joined: {stores::store_id: chararray,stores::name: chararray,salespeople::person_id: int,salespeople::name: chararray,salespeople::store_id: chararray} grunt> cleaned = FOREACH joined GENERATE stores::store_id, stores::name, person_id, salespeople::name; grunt> DUMP cleaned; (A,Anchorage,4,Dieter) (B,Boston,1,Alice) (B,Boston,8,Hannah) ... (additional records omitted for brevity) ...
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐16
Outer Joins § Pig La:n allows you to specify the type of join following the field name – Inner joins do not specify a join type joined = JOIN relation1 BY field [LEFT|RIGHT|FULL] OUTER, relation2 BY field;
§ An outer join does not require the key to be found in both inputs § Outer joins require Pig to know the schema for at least one rela:on – Which relaEon requires schema depends on the join type – Full outer joins require schema for both relaEons
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐17
LeO Outer Join Example stores A B C D E F
§ Result contains all records from the rela:on specified on the le[, but only matching records from the one specified on the right
Anchorage Boston Chicago Dallas Edmonton Fargo
grunt> joined = JOIN stores BY store_id LEFT OUTER, salespeople BY store_id;
salespeople 1 2 3 4 5 6 7 8 9 10
Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack
B D F A F C D B C
grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (E,Edmonton,,,) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐18
Right Outer Join Example stores A B C D E F
§ Result contains all records from the rela:on specified on the right, but only matching records from the one specified on the le[
Anchorage Boston Chicago Dallas Edmonton Fargo
grunt> joined = JOIN stores BY store_id RIGHT OUTER, salespeople BY store_id;
salespeople 1 2 3 4 5 6 7 8 9 10
Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack
B D F A F C D B C
grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F) (,,10,Jack,)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐19
Full Outer Join Example stores A B C D E F
§ Result contains all records where there is a match in either rela:on
Anchorage Boston Chicago Dallas Edmonton Fargo
grunt> joined = JOIN stores BY store_id FULL OUTER, salespeople BY store_id;
salespeople 1 2 3 4 5 6 7 8 9 10
Alice Bob Carlos Dieter Étienne Fredo George Hannah Irina Jack
B D F A F C D B C
grunt> DUMP joined; (A,Anchorage,4,Dieter,A) (B,Boston,1,Alice,B) (B,Boston,8,Hannah,B) (C,Chicago,6,Fredo,C) (C,Chicago,9,Irina,C) (D,Dallas,2,Bob,D) (D,Dallas,7,George,D) (E,Edmonton,,,) (F,Fargo,3,Carlos,F) (F,Fargo,5,Étienne,F) (,,10,Jack,)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐20
Chapter Topics Mul:-‐Dataset Opera:ons with Pig § Techniques for Combining Data Sets § Joining Data Sets in Pig § Set Opera:ons § SpliMng Data Sets
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐21
Crossing Data Sets § JOIN finds records in one rela:on that match records in another § Pig’s CROSS operator creates the cross product of both rela:ons – Combines all records in both tables regardless of matching – In other words, all possible combinaEons of records crossed = CROSS stores, salespeople;
§ Careful: This can generate huge amounts of data!
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐22
Cross Product Example stores A B D
§ Generates every possible combina:on of records in the stores and salespeople rela:ons
Anchorage Boston Dallas
grunt> crossed = CROSS stores, salespeople; salespeople 1 2 8 10
Alice Bob Hannah Jack
B D B
grunt> DUMP crossed; (A,Anchorage,1,Alice,B) (A,Anchorage,2,Bob,D) (A,Anchorage,8,Hannah,B) (A,Anchorage,10,Jack,) (B,Boston,1,Alice,B) (B,Boston,2,Bob,D) (B,Boston,8,Hannah,B) (B,Boston,10,Jack,) (D,Dallas,1,Alice,B) (D,Dallas,2,Bob,D) (D,Dallas,8,Hannah,B) (D,Dallas,10,Jack,)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐23
ConcatenaEng Data Sets § We have explored several techniques for combining data sets – They have had one thing in common: they combine horizontally § The UNION operator combines records ver:cally – It adds data from input relaEons into a new single relaEon – Pig does not require these inputs to have the same schema – It does not eliminate duplicate records nor preserve order § This is helpful for incorpora:ng new data into your processing both = UNION june_items, july_items;
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐24
UNION Example june Adapter Battery Cable DVD HDTV
549 349 799 1999 79999
july Fax GPS HDTV Ink
17999 24999 65999 3999
§ Concatenates all records from june and july grunt> both = UNION june_items, july_items; grunt> DUMP both; (Fax,17999) (GPS,24999) (HDTV,65999) (Ink,3999) (Adapter,549) (Battery,349) (Cable,799) (DVD, 1999) (HDTV,79999)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐25
Chapter Topics Mul:-‐Dataset Opera:ons with Pig § Techniques for Combining Data Sets § Joining Data Sets in Pig § Set OperaEons § SpliBng Data Sets
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐26
SpliMng Data Sets § You have learned several ways to combine data sets into a single rela:on § Some:mes you need to split a data set into mul:ple rela:ons – Server logs by date range – Customer lists by region – Product lists by vendor § Pig La:n supports this with the SPLIT operator SPLIT relation INTO relationA IF expression1, relationB IF expression2, relationC IF expression3...;
– Expressions need not be mutually exclusive
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐27
SPLIT Example § Split customers into groups for rewards program, based on life:me value customers Annette Bruce Charles Dustin Eva Felix Glynn Henry Ian Jeff Kai Laura Mirko
9700 23500 17800 21250 8500 9300 27800 8900 43800 29100 34000 7800 24200
grunt> SPLIT customers INTO gold_program IF ltv >= 25000, silver_program IF ltv >= 10000 AND ltv < 25000; grunt> DUMP gold_program; (Glynn,27800) (Ian,43800) (Jeff,29100) (Kai,34000) grunt> DUMP silver_program; (Bruce,23500) (Charles,17800) (Dustin,21250) (Mirko,24200)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐28
EssenEal Points § You can use COGROUP to group mul:ple rela:ons – This creates a nested data structure § Pig supports common SQL join types – Inner, leO outer, right outer, and full outer – You may need to fully-‐qualify field names when using joined data § Pig’s CROSS operator creates every possible combina:on of input data – This can create huge amounts of data – use it carefully! § You can use a UNION to concatenate data sets § In addi:on to combining data sets, Pig supports spliBng them too
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐29
Bibliography The following offer more informa:on on topics discussed in this chapter § For Pig group and cogroup informa:on: – http://pig.apache.org/docs/r0.10.0/ basic.html#GROUP
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐30
Extending Pig Chapter 10.2
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐31
Extending Pig § How to use parameters in your Pig La:n to increase its flexibility § How to define and invoke macros to improve the reusability of your code § How to call user-‐defined func:ons from your code § How to write user-‐defined func:ons in Python § How to process data with external scripts
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐32
Chapter Topics Extending Pig § Adding Flexibility with Parameters § Macros and Imports § UDFs § Contributed FuncEons § Using Other Languages to Process Data with Pig
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐33
The Need for Parameters (1) § Some processing is very repe::ve – For example, creaEng sales reports allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 999; bigsales_alice = FILTER bigsales BY name == 'Alice'; STORE bigsales_alice INTO 'Alice';
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐34
The Need for Parameters (2) § You may need to change the script slightly for each run – For example, to modify the paths or filter criteria allsales = LOAD 'sales' AS (name, price); bigsales = FILTER allsales BY price > 999; bigsales_alice = FILTER bigsales BY name == 'Alice'; STORE bigsales_alice INTO 'Alice';
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐35
Making the Script More Flexible with Parameters § Instead of hardcoding values, Pig allows you to use parameters – These are replaced with specified values at runEme allsales = LOAD '$INPUT' AS (name, price); bigsales = FILTER allsales BY price > $MINPRICE; bigsales_name = FILTER bigsales BY name == '$NAME'; STORE bigsales_name INTO '$NAME';
– Then specify the values on the command line $ pig -p INPUT=sales -p MINPRICE=999 \ -p NAME='Jo Anne' reporter.pig
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐36
Two Tricks for Specifying Parameter Values § You can also specify parameter values in a text file – An alternaEve to typing each one on the command line INPUT=sales MINPRICE=999 # comments look like this NAME='Alice'
– Use -m filename opEon to tell Pig which file contains the values § Parameter values can be defined with the output of a shell command – For example, to set MONTH to the current month: MONTH=`date +'%m'`
# returns 03 for March, 05 for May
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐37
Chapter Topics Extending Pig § Adding Flexibility with Parameters § Macros and imports § UDFs § Contributed FuncEons § Using Other Languages to Process Data with Pig
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐38
The Need for Macros § Parameters simplify repe::ve code by allowing you to pass in values – But someEmes you would like to reuse the actual code too allsales = LOAD 'sales' AS (name, price); byperson = FILTER allsales BY name == 'Alice'; SPLIT byperson INTO low IF price < 1000, high IF price >= 1000; amt1 = FOREACH low GENERATE name, price * 0.07 AS amount; amt2 = FOREACH high GENERATE name, price * 0.12 AS amount; commissions = UNION amt1, amt2; grpd = GROUP commissions BY name; out = FOREACH grpd GENERATE SUM(commissions.amount) AS total;
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐39
Defining a Macro in Pig LaEn § Macros allow you to define a block of code to reuse easily – Similar (but not idenEcal) to a funcEon in a programming language define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT) returns result { allsales = LOAD 'sales' AS (name, price); byperson = FILTER allsales BY name == '$NAME'; SPLIT byperson INTO low if price < $SPLIT_AMT, high IF price >= $SPLIT_AMT; amt1 = FOREACH low GENERATE name, price * $LOW_PCT AS amount; amt2 = FOREACH high GENERATE name, price * $HIGH_PCT AS amount; commissions = UNION amt1, amt2; grouped = GROUP commissions BY name; $result = FOREACH grouped GENERATE SUM(commissions.amount); }; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐40
Invoking Macros § To invoke a macro, call it by name and supply values in the correct order define calc_commission (NAME, SPLIT_AMT, LOW_PCT, HIGH_PCT) returns result { allsales = LOAD 'sales' AS (name, price); ... (other code removed for brevity) ... $result = FOREACH grouped GENERATE SUM(commissions.amount); }; alice_comm = calc_commission('Alice', 1000, 0.07, 0.12); carlos_comm = calc_commission('Carlos', 2000, 0.08, 0.14);
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐41
Reusing Code with Imports § A[er defining a macro, you may wish to use it in mul:ple scripts § You can include one script within another, star:ng with Pig 0.9 – This is done with the import keyword and path to file being imported -- We saved the macro to a file named commission_calc.pig import 'commission_calc.pig'; alice_comm = calc_commission('Alice', 1000, 0.07, 0.12);
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐42
Chapter Topics Extending Pig § Adding Flexibility with Parameters § Macros and Imports § UDFs § Contributed FuncEons § Using Other Languages to Process Data with Pig
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐43
User-‐Defined FuncEons (UDFs) § We have covered many of Pig’s built-‐in func:ons already § It is also possible to define your own func:ons – Pig allows wriEng UDFs in several languages Language
Supported in Pig Versions
Java
All
Python
0.8 and later
JavaScript (experimental)
0.9 and later
Ruby (experimental)
0.10 and later
Groovy (experimental)
0.11 and later
§ In the next few slides, you will see how to use UDFs in Java, and how to write and use UDFs in Python
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐44
Using UDFs Wri>en in Java § UDFs are packaged into Java Archive (JAR) files § There are only two required steps for using them – Register the JAR file(s) containing the UDF and its dependencies – Invoke the UDF using the fully-‐qualified classname REGISTER '/path/to/myudf.jar'; ... data = FOREACH allsales GENERATE com.example.MYFUNC(name);
§ You can op:onally define an alias for the func:on REGISTER '/path/to/myudf.jar'; DEFINE FOO com.example.MYFUNC; ... data = FOREACH allsales GENERATE FOO(name);
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐45
WriEng UDFs in Python (1) § Now we will see how to write a UDF in Python § The data we want to process has inconsistent phone number formats
Alice Bob Carlos David
(314) 555-1212 212.555.9753 405-555-3912 (202) 555.8471
§ We will write a Python UDF that can consistently extract the area code
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐46
WriEng UDFs in Python (2) § Our Python code is straighiorward § The only unusual thing is the op:onal @outputSchema decorator – This tells Pig what data type we are returning – If not specified, Pig will assume bytearray @outputSchema("areacode:chararray") def get_area_code(phone): areacode = "???" # return this for unknown formats if len(phone) == 12: # XXX-YYY-ZZZZ or XXX.YYY.ZZZZ format areacode = phone[0:3] elif len(phone) == 14: # (XXX) YYY-ZZZZ or (XXX) YYY.ZZZZ format areacode = phone[1:4] return areacode
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐47
Invoking Python UDFs from Pig LaEn § Using this UDF from our Pig La:n is also easy – We saved our Python code as phonenumber.py – This Python file is in our current directory REGISTER 'phonenumber.py' USING jython AS phoneudf; names = LOAD 'names' AS (name:chararray, phone:chararray); areacodes = FOREACH names GENERATE phoneudf.get_area_code(phone) AS ac;
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐48
Chapter Topics Extending Pig § Adding Flexibility with Parameters § Macros and Imports § UDFs § Contributed Func:ons § Using Other Languages to Process Data with Pig
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐49
Open Source UDFs § Pig ships with a set of community-‐contributed UDFs called Piggy Bank § Another popular package of UDFs, called DataFu, has been open-‐sourced by LinkedIn
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐50
Piggy Bank § Piggy Bank ships with Pig – You will need to register the piggybank.jar file – The locaEon may vary depending on source and version – In CDH on our VMs, it is at /usr/lib/pig/piggybank.jar § Some UDFs in Piggy Bank include (package names omiked for brevity) Class Name
Descrip:on
ISOToUnix
Converts an ISO 8601 date/Eme format to UNIX format
UnixToISO
Converts a UNIX date/Eme format to ISO 8601 format
LENGTH
Returns the number of characters in the supplied string
HostExtractor
Returns the host name from a URL
DiffDate
Returns number of days between two dates
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐51
DataFu § DataFu does not ship with Pig, but is part of CDH 4.1.0 and later – You will need to register the DataFu JAR file – In VM, at /usr/lib/pig/datafu-0.0.4-cdh4.2.0.jar § Some UDFs in DataFu include (package names omiked for brevity) Class Name
Descrip:on
Quantile
Calculates quanEles for a data set
Median
Calculates the median for a data set
Sessionize
Groups data into sessions based on a specified Eme window
HaversineDistInMiles
Calculates distance in miles between two points, given laEtude and longitude
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐52
Using A Contributed UDF § Here is an example of using a UDF from DataFu to calculate distance 37.789336
-122.401385
40.707555
-74.011679
REGISTER '/usr/lib/pig/datafu-*.jar'; DEFINE DIST datafu.pig.geo.HaversineDistInMiles;
Input data
Pig LaEn
places = LOAD 'data' AS (lat1:double, lon1:double, lat2:double, lon2:double); dist = FOREACH places GENERATE DIST(lat1, lon1, lat2, lon2); DUMP dist;
(2564.207116295711)
Output data
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐53
Chapter Topics Extending Pig § Adding Flexibility with Parameters § Macros and Imports § UDFs § Contributed FuncEons § Using Other Languages to Process Data with Pig
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐54
Processing Data with An External Script § While Pig La:n is powerful, some tasks are easier in another language § Pig allows you to stream data through another language for processing – This is done using the STREAM keyword § Similar concept to Hadoop Streaming – Data is supplied to the script on standard input as tab-‐delimited fields – Script writes results to standard output as tab-‐delimited fields
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐55
STREAM Example in Python (1) § Our example will calculate a user’s age given that user’s birthdate – This calculaEon is done in a Python script named agecalc.py § Here is the corresponding Pig La:n code – BackEcks used to quote script name following the alias – Single quotes used for quoEng script name within SHIP – The schema for the data produced by the script follows the AS keyword DEFINE MYSCRIPT `agecalc.py` SHIP('agecalc.py'); users = LOAD 'data' AS (name:chararray, birthdate:chararray); out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int); DUMP out;
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐56
STREAM Example in Python (2) § Python code for agecalc.py #!/usr/bin/env python import sys from datetime import datetime for line in sys.stdin: line = line.strip() (name, birthdate) = line.split("\t") d1 = datetime.strptime(birthdate, '%Y-%m-%d') d2 = datetime.now() age = int((d2 - d1).days / 365) print "%s\t%i" % (name, age)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐57
STREAM Example in Python (3) § The Pig script again, and the data it reads and writes DEFINE MYSCRIPT `agecalc.py` SHIP('agecalc.py'); users = LOAD 'data' AS (name:chararray, birthdate:chararray); out = STREAM users THROUGH MYSCRIPT AS (name:chararray, age:int); DUMP out;
Output data
Input data andy betty chuck debbie
1963-11-15 1985-12-30 1979-02-23 1982-09-19
(andy,49) (betty,27) (chuck,34) (debbie,30)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐58
EssenEal Points § Pig supports several extension mechanisms § Parameters and macros can help make your code more reusable – And easier to maintain and share with others § Piggy Bank and DataFu are two examples of open source UDFs – You can also write your own UDFs § It is also possible to embed Pig within another language
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐59
Bibliography The following offer more informa:on on topics discussed in this chapter § Documenta:on on Parameter Subs:tu:on in Pig – http://tiny.cloudera.com/dac07a § Documenta:on on Macros in Pig – http://tiny.cloudera.com/dac07b § Documenta:on on User-‐Defined Func:ons in Pig – http://tiny.cloudera.com/dac07c § Documenta:on on Piggy Bank – http://tiny.cloudera.com/dac07d § Introducing Data Fu – http://tiny.cloudera.com/dac07e
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐60
Bibliography (cont’d) The following offer more informa:on on topics discussed in this chapter § Details about the other supported languages can be found in the latest documenta:on on the Pig Web site: – http://pig.apache.org/docs/r0.11.1/udf.html § For details on Python UDFs, see – http://pig.apache.org/docs/r0.10.0/ udf.html#python-udfs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐61
Pig TroubleshooEng and OpEmizaEon Chapter 10.3
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐62
Pig TroubleshooEng and OpEmizaEon § How to control the informa:on that Pig and Hadoop write to log files § How Hadoop’s Web UI can help you troubleshoot failed jobs § How to use SAMPLE and ILLUSTRATE to test and debug Pig jobs § How Pig creates MapReduce jobs from your Pig La:n code § How several simple changes to your Pig La:n code can make it run faster § Which resources are especially helpful for troubleshoo:ng Pig errors
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐63
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § Troubleshoo:ng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the ExecuEon Plan § Tips for Improving the Performance of Your Pig Jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐64
TroubleshooEng Overview § We have now covered how to use Pig for data analysis – Unfortunately, someEmes your code may not work as you expect – It is important to remember that Pig and Hadoop are intertwined § Here we will cover some techniques for isola:ng and resolving problems – We will start with a few opEons to the Pig command
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐65
Helping Yourself § We will discuss some op:ons for the pig command in this chapter – You can view all of them by using the -h (help) opEon – Keep in mind that many opEons are advanced or rarely used § One useful op:on is -c (check), which validates the syntax of your code $ pig -c myscript.pig myscript.pig syntax OK
§ The -dryrun op:on is very helpful if you use parameters or macros $ pig -p INPUT=demodata -dryrun myscript.pig
– Creates a myscript.pig.substituted file in current directory
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐66
GeMng Help from Others § Some:mes you may need help from others – Mailing lists or newsgroups – Forums and bulleEn board sites – Support services § You will probably need to provide the version of Pig and Hadoop you are using $ pig -version Apache Pig version 0.10.0-cdh4.2.0 $ hadoop version Hadoop 2.0.0-cdh4.2.0
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐67
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § TroubleshooEng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the ExecuEon Plan § Tips for Improving the Performance of Your Pig Jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐68
Customizing Log Messages § You may wish to change how much informa:on is logged – A recent change in Hadoop can cause lots of warnings when using Pig § Pig and Hadoop use the Log4J library, which is easily customized § Edit the /etc/pig/conf/log4j.properties file to include: log4j.logger.org.apache.pig=ERROR log4j.logger.org.apache.hadoop.conf.Configuration=ERROR
§ Edit the /etc/pig/conf/pig.properties file to set this property: log4jconf=/etc/pig/conf/log4j.properties
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐69
Customizing Log Messages on a Per-‐Job Basis § O[en you just want to temporarily change the log level – Especially while trying to troubleshoot a problem with your script § You can specify a Log4J proper:es file to use when you invoke Pig – This overrides the default Log4J configuraEon § Create a customlog.properties file to include: log4j.logger.org.apache.pig=DEBUG
§ Specify this file via the -log4jconf argument to Pig $ pig -log4jconf customlog.properties
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐70
Controlling Client-‐Side Log Files § When a job fails, Pig may produce a log file to explain why – These are typically produced in your current directory § To use a different loca:on, use the -l (log) op:on when star:ng Pig $ pig -l /tmp
§ Or set it permanently by edi:ng /etc/pig/conf/pig.properties – Specify a different directory using the log.file property log.file=/tmp
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐71
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § TroubleshooEng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the ExecuEon Plan § Tips for Improving the Performance of Your Pig Jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐72
The Hadoop Web UI § Each Hadoop daemon has a corresponding Web applica:on – This allows us to easily see cluster and job status with a browser – In pseudo-‐distributed mode, the hostname is localhost
HDFS MR1 MR2
Daemon Name
Address
NameNode
http://hostname:50070/
DataNode
http://hostname:50075/
JobTracker
http://hostname:50030/
TaskTracker
http://hostname:50060/
ResourceManager http://hostname:8088/ NodeManager
http://hostname:8042/
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐73
The JobTracker Web UI (1) § The JobTracker offers the most useful of Hadoop’s Web UIs – It displays MapReduce status informaEon for the Hadoop cluster
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐74
The JobTracker Web UI (2) § The JobTracker Web UI also shows historical informa:on – You can click one of the links to see details for a parEcular job
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐75
The JobTracker Web UI (3) § The job detail page can help you troubleshoot a problem
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐76
Naming Your Job § Hadoop clusters are typically shared resources – There might be dozens or hundreds of others using it – As a result, someEmes it is hard to find your job in the Web UI § We recommend seBng a name in your scripts to help iden:fy your jobs – Set the job.name property, either in Grunt or your script grunt> set job.name 'Q2 2013 Sales Reporter';
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐77
Killing a Job § A job that processes a lot of data can take hours to complete – SomeEmes you spot an error in your code just aOer submiMng a job – Rather than wait for the job to complete, you can kill it § First, find the Job ID on the front page of the JobTracker Web UI
§ Then, use the kill command in Pig along with that Job ID grunt> kill job_201303151454_0028
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐78
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § TroubleshooEng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the ExecuEon Plan § Tips for Improving the Performance of Your Pig Jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐79
Using SAMPLE to Create a Smaller Data Set § Your code might process terabytes of data in produc:on – However, it is convenient to test with smaller amounts during development § Use SAMPLE to choose a random set of records from a data set § This example selects about 5% of records from bigdata – Stores them in a new directory called mysample everything = LOAD 'bigdata'; subset = SAMPLE everything 0.05; STORE subset INTO 'mysample';
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐80
Intelligent Sampling with ILLUSTRATE § Some:mes a random sample may lack data needed for tes:ng – For example, matching records in two data sets for a JOIN operaEon § Pig’s ILLUSTRATE keyword can do more intelligent sampling – Pig will examine the code to determine what data is needed – It picks a few records that properly exercise the code § You should specify a schema when using ILLUSTRATE – Pig will generate records when yours don’t suffice
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐81
Using ILLUSTRATE Helps You to Understand Data Flow § Like DUMP and DESCRIBE, ILLUSTRATE aids in debugging – The syntax is the same for all three grunt> allsales = LOAD 'sales' AS (name:chararray, price:int); grunt> bigsales = FILTER allsales BY price > 999; grunt> ILLUSTRATE bigsales; (Bob,3625) -------------------------------------------------| allsales | name:chararray | price:int | -------------------------------------------------| | Bob | 3625 | | | Bob | 998 | --------------------------------------------------------------------------------------------------| bigsales | name:chararray | price:int | -------------------------------------------------| | Bob | 3625 | -------------------------------------------------© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐82
General Debugging Strategies § Use DUMP, DESCRIBE, and ILLUSTRATE o[en – The data might not be what you think it is § Look at a sample of the data – Verify that it matches the fields in your LOAD specificaEon § Other helpful steps for tracking down a problem – Use -dryrun to see the script aOer parameters and macros are processed – Test external scripts (STREAM) by passing some data from a local file – Look at the logs, especially task logs available from the Web UI
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐83
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § TroubleshooEng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the ExecuEon Plan § Tips for Improving the Performance of Your Pig Jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐84
Performance Overview § We have discussed several techniques for finding errors in Pig La:n code – Once you get your code working, you’ll oOen want it to work faster § Performance tuning is a broad and complex subject – Requires a deep understanding of Pig, Hadoop, Java, and Linux – Typically the domain of engineers and system administrators § Most of these topics are beyond the scope of this course – We’ll cover the basics and offer several performance improvement Eps – See Programming Pig (chapters 7 and 8) for detailed coverage
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐85
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § TroubleshooEng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the Execu:on Plan § Tips for Improving the Performance of your Pig jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐86
How Pig LaEn Becomes a MapReduce Job § Pig La:n code ul:mately runs as MapReduce jobs on the Hadoop cluster § However, Pig does not translate your code into Java MapReduce – Much like relaEonal databases don’t translate SQL to C language code – Like a database, Pig interprets the Pig LaEn to develop execuEon plans – Pig’s execuEon engine uses these to submit MapReduce jobs to Hadoop § The EXPLAIN keyword details Pig’s three execu:on plans – Logical – Physical – MapReduce § Seeing an example job will help us beker understand EXPLAIN’s output
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐87
DescripEon of Our Example Code and Data § Our goal is to produce a list of per-‐store sales stores A B C D E F
Anchorage Boston Chicago Dallas Edmonton Fargo sales
A D A B A B E D
1999 2399 4579 6139 2489 3699 2479 5799
grunt> stores = LOAD 'stores' AS (store_id:chararray, name:chararray); grunt> sales = LOAD 'sales' AS (store_id:chararray, price:int); grunt> groups = GROUP sales BY store_id; grunt> totals = FOREACH groups GENERATE group, SUM(sales.price) AS amount; grunt> joined = JOIN totals BY group, stores BY store_id; grunt> result = FOREACH joined GENERATE name, amount; grunt> DUMP result; (Anchorage,9067) (Boston,9838) (Dallas,8198) (Edmonton,2479)
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐88
Using the EXPLAIN Keyword § Using EXPLAIN rather than DUMP will show the execu:on plans grunt> DUMP result; (Anchorage,9067) (Boston,9838) (Dallas,8198) (Edmonton,2479) grunt> EXPLAIN result; #----------------------------------------------# New Logical Plan: #----------------------------------------------result: (Name: LOStore Schema: stores::name#49:chararray,totals::amount#70:long) | |---result: (Name: LOForEach Schema: stores::name#49:chararray,totals::amount#70:long) (other lines, including physical and MapReduce plans, would follow) © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐89
Chapter Topics Pig Troubleshoo:ng And Op:miza:on § TroubleshooEng Pig § Logging § Using Hadoop’s Web UI § Data Sampling and Debugging § Performance Overview § Understanding the ExecuEon Plan § Tips for Improving the Performance of your Pig Jobs
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐90
Pig’s RunEme OpEmizaEons § Pig does not necessarily run your statements exactly as you wrote them § It may remove opera:ons for efficiency sales = LOAD 'sales' AS (store_id:chararray, price:int); unused = FILTER sales BY price > 789; DUMP sales;
§ It may also rearrange opera:ons for efficiency grouped = GROUP sales BY store_id; totals = FOREACH grouped GENERATE group, SUM(sales.price); joined = JOIN totals BY group, stores BY store_id; only_a = FILTER joined BY store_id == 'A'; DUMP only_a;
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐91
OpEmizaEons You Can Make in Your Pig LaEn Code § Pig’s op:mizer does what it can to improve performance – But you know your own code and data be>er than it does – A few small changes in your code can allow addiEonal opEmizaEons § On the next few slides, we will rewrite this Pig code for performance stores = LOAD 'stores' AS (store_id, name, postcode, phone); sales = LOAD 'sales' AS (store_id, price); joined = JOIN sales BY store_id, stores BY store_id; DUMP joined; groups = GROUP joined BY sales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton'; sorted = ORDER region BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐92
Don’t Produce Output You Don’t Really Need § In this case, we forgot to remove the DUMP statement – SomeEmes happens when moving from development to producEon – And it might go unnoEced if you’re not watching the terminal
stores = LOAD 'stores' AS (store_id, name, postcode, phone); sales = LOAD 'sales' AS (store_id, price); joined = JOIN sales BY store_id, stores BY store_id; DUMP joined; groups = GROUP joined BY sales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton'; sorted = ORDER region BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐93
Specify Schema Whenever Possible § Specifying schema when loading data eliminates the need for Pig to guess – It may choose a ‘bigger’ type than you need (e.g., long instead of int) § The postcode and phone fields in the stores data set were also never used – EliminaEng them in our schema ensures they’ll be omi>ed in joined stores sales joined groups totals
= = = = =
LOAD 'stores' AS (store_id:chararray, name:chararray); LOAD 'sales' AS (store_id:chararray, price:int); JOIN sales BY store_id, stores BY store_id; GROUP joined BY sales::store_id; FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; region = FILTER unique BY name == 'Anchorage' OR name == 'Edmonton'; sorted = ORDER region BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore';
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐94
Filter Unwanted Data As Early As Possible § We previously did our JOIN before our FILTER – This produced lots of data we ulEmately discarded – Moving the FILTER operaEon up makes our script more efficient – Caveat: We now have to filter by store ID rather than store name stores = sales = regsales joined = groups = totals =
LOAD 'stores' AS (store_id:chararray, name:chararray); LOAD 'sales' AS (store_id:chararray, price:int); = FILTER sales BY store_id == 'A' OR store_id == 'E'; JOIN regsales BY store_id, stores BY store_id; GROUP joined BY regsales::store_id; FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; sorted = ORDER unique BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore';
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐95
Consider AdjusEng the ParallelizaEon § Hadoop clusters scale by processing data in parallel – Newer Pig releases choose the number of reducers based on input size – However, it is oOen beneficial to set a value explicitly in your script – Your system administrator can help you determine the best value set default_parallel 5; stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); regsales = FILTER sales BY store_id == 'A' OR store_id == 'E'; joined = JOIN regsales BY store_id, stores BY store_id; groups = GROUP joined BY regsales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; sorted = ORDER unique BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐96
Specify the Smaller Data Set First in a Join § We can op:mize joins by specifying the larger data set last – Pig will ‘stream’ the larger data set instead of reading it into memory – In our case, we have far more records in sales than in stores – Changing the order in the JOIN statement can boost performance set default_parallel 5; stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); regsales = FILTER sales BY store_id == 'A' OR store_id == 'E'; joined = JOIN stores BY store_id, regsales BY store_id; groups = GROUP joined BY regsales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; unique = DISTINCT totals; sorted = ORDER unique BY amount DESC; topone = LIMIT sorted 1; STORE topone INTO 'topstore'; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐97
Try Using Compression on Intermediate Data § Pig scripts o[en yield jobs with both a Map and a Reduce phase – Remember that Mapper output becomes Reducer input – Compressing this intermediate data is easy and can boost performance – Your system administrator may need to install a compression library set mapred.compress.map.output true; set mapred.map.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; set default_parallel 5; stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); regsales = FILTER sales BY store_id == 'A' OR store_id == 'E'; joined = JOIN stores BY store_id, regsales BY store_id; groups = GROUP joined BY regsales::store_id; totals = FOREACH groups GENERATE FLATTEN(joined.stores::name) AS name, SUM(joined.sales::price) AS amount; ... (other lines unchanged, but removed for brevity) ... © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐98
A Few More Tips for Improving Performance § Main theme: Eliminate unnecessary data as early as possible – Use FOREACH ... GENERATE to select just those fields you need – Use ORDER BY and LIMIT when you only need a few records – Use DISTINCT when you don’t need duplicate records § Dropping records with NULL keys before a join can boost performance – These records will be eliminated in the final output anyway – But Pig doesn’t discard them unEl aOer the join – Use FILTER to remove records with null keys before the join stores = LOAD 'stores' AS (store_id:chararray, name:chararray); sales = LOAD 'sales' AS (store_id:chararray, price:int); nonnull_stores = FILTER stores BY store_id IS NOT NULL; nonnull_sales = FILTER sales BY store_id IS NOT NULL; joined = JOIN nonnull_stores BY store_id, nonnull_sales BY store_id; © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐99
EssenEal Points § You can boost performance by elimina:ng unneeded data during processing § Pig’s error messages don’t always clearly iden:fy the source of a problem – We recommend tesEng your scripts with a small data sample – Looking at the Web UI, and especially the log messages, can be helpful § The resources listed on the upcoming bibliography slide may further assist you in solving problems
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐10 0
Bibliography The following offer more informa:on on topics discussed in this chapter § Hadoop Log Loca:on and Reten:on – http://tiny.cloudera.com/dac08a § Pig Tes:ng and Diagnos:cs – http://tiny.cloudera.com/dac08b § Mailing List for Pig Users – http://tiny.cloudera.com/dac08c § Ques:ons Tagged with “Pig” on StackOverflow – http://tiny.cloudera.com/dac08d § Ques:ons Tagged with “PigLa:n” on StackOverflow – http://tiny.cloudera.com/dac08e
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐10 1
Bibliography The following offer more informa:on on topics discussed in this chapter § Performance :ps for Pig – http://pig.apache.org/docs/r0.10.0/ perf.html#performance-enhancers
© Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
10-‐10 2