An Introduc+on to Hadoop Mark Fei
Cloudera Strata + Hadoop World 2012, New York City, October 23, 2012
1
Who Am I? Mark Fei Cloudera! Durango, Colorado!
Current:! Past:!
Senior Instructor at Cloudera! Professional Services – Education, VMware! Senior Member Technical Staff, Hill Associates! Sales Engineer, Nortel Networks! Systems Programmer, large Bank! Banking Applications software developer!
What’s Ahead? •
Solid introduc+on to Apache Hadoop • • • •
• •
What it is Why it’s relevant How it works The Ecosystem
No prior experience needed Feel free to ask ques+ons
What is Apache Hadoop? •
Scalable data storage and processing • • •
•
Open source Apache project Harnesses the power of commodity servers Distributed and fault-‐tolerant
“Core” Hadoop consists of two main parts • •
HDFS (storage) MapReduce (processing)
A large ecosystem
Who uses Hadoop?
Vendor integration
BI / Analytics
ETL
Database
OS / Cloud / System Mgmt.
Hardware
About Cloudera • • • •
Cloudera is “The commercial Hadoop company” Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects
Cloudera Software •
Cloudera’s Distribution including Apache Hadoop (CDH) • • •
•
A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version 100% open source
Components • • • • • •
Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout
A Coherent Platform Components of the CDH Stack File System Mount
UI Framework
FUSE-DFS
Workflow APACHE OOZIE
SDK HUE
Scheduling APACHE OOZIE
Languages / Compilers Data Integration
Storage
APACHE PIG, APACHE HIVE, APACHE MAHOUT
APACHE FLUME, APACHE SQOOP
HUE SDK
Computation
Metadata APACHE HIVE
Fast Read/Write Access
Integration Coordination
APACHE HBASE HDFS, MAPREDUCE
Coordination
APACHE ZOOKEEPER
Access
Cloudera Manager, Free Edition •
End-to-end Deployment and management of your CDH cluster •
• •
‘Zero to Hadoop in 15 minutes’
Supports up to 50 nodes Free (but not open source)
Cloudera Enterprise •
Cloudera Enterprise •
Cloudera’s Distribution including Apache Hadoop (CDH) •
•
Cloudera Manager (full version) • •
•
Big data storage, processing and analytics platform based on CDH End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the free version
Production support •
A team of experts on call to help you meet your Service Level Agreements (SLAs)
Cloudera University •
Training for the entire Hadoop stack • • • • • •
•
Public and private classes offered •
•
Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming Including customized on-site private classes
Industry-recognized Certifications • • •
Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)
Professional Services •
Solutions Architects provide guidance and handson expertise • • • • • •
Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification
How Did Apache Hadoop Originate? •
Heavily influenced by Google’s architecture •
•
Notably, the Google Filesystem and MapReduce papers
Other Web companies quickly saw the benefits •
Early adop+on by Yahoo, Facebook and others Nutch spun off from Lucene
2002
Google publishes GFS paper
2003
Google publishes MapReduce paper
Nutch rewritten for MapReduce
2004
2005
Why Do We Have So Much Data? •
And what are we supposed to do with it?
Velocity •
Why we’re genera+ng data faster than ever • • •
Processes are increasingly automated Systems are increasingly interconnected People are increasingly interac+ng online
Variety •
What types of data are we producing? • • • • •
•
Applica+on logs Text messages Social network connec+ons Tweets Photos
Not all of this maps cleanly to the rela+onal model
Volume •
The result of this is that every single day • • •
•
Twi]er processes 340 million messages Facebook stores 2.7 billion comments and “Likes” Google processes about 24 petabytes of data
And every single minute • •
More than 200 million e-‐mail messages are sent Foursquare processes more than 2,000 check-‐ins
Where Does Data Come From? •
Science •
•
Industry •
•
Financial, pharmaceu+cal, manufacturing, insurance, online, energy, retail data
Legacy •
•
Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc.
Sales data, customer behavior, product databases, accoun+ng data, etc.
System Data •
Log files, health & status feeds, ac+vity streams, network messages, Web Analy+cs, intrusion detec+on, spam filters
Analyzing Data: The Challenges • •
Huge volumes of data Mixed sources result in many different formats • • • • • • • • • •
XML CSV EDI Log files Objects SQL Text JSON Binary Etc.
What is Common Across Hadoop-‐able Problems? •
Nature of the data • • •
•
Complex data Mul+ple data sources Lots of it
Nature of the analysis • • •
Batch processing Parallel execu+on Spread data over a cluster of servers and take the computa+on to the data
Benefits of Analyzing With Hadoop • • • • •
Previously impossible/imprac+cal to do this analysis Analysis conducted at lower cost Analysis conducted in less +me Greater flexibility Linear scalability
What Analysis is Possible With Hadoop? •
Text mining
•
Collabora+ve filtering
•
Index building
•
Predic+on models
•
Graph crea+on and analysis
•
Sen+ment analysis
•
Pa]ern recogni+on
•
Risk assessment
Eight Common Hadoop-‐able Problems 1.
Modeling true risk
2.
Customer churn analysis
3. 4.
Recommenda+on engine PoS transac+on analysis
5.
Analyzing network data to predict failure
6.
Threat analysis
7.
Search quality
8.
Data “sandbox”
1. Modeling True Risk Challenge: • How much risk exposure does an organiza+on really have with each customer? •
Mul+ple sources of data and across mul+ple lines of business
Solu+on with Hadoop: • Source and aggregate disparate data sources to build data picture •
•
e.g. credit card records, call recordings, chat sessions, emails, banking ac+vity
Structure and analyze •
Sen+ment analysis, graph crea+on, pa]ern recogni+on
Typical Industry: •
Financial Services (banks, insurance companies)
2. Customer Churn Analysis Challenge: • Why is an organiza+on really losing customers? •
Data on these factors comes from different sources
Solu-on with Hadoop: • Rapidly build behavioral model from disparate data sources • Structure and analyze with Hadoop • • •
Traversing Graph crea+on Pa]ern recogni+on
Typical Industry: •
Telecommunica+ons, Financial Services
3. Recommenda+on Engine/Ad Targe+ng Challenge: • Using user data to predict which products to recommend Solu+on with Hadoop: • Batch processing framework •
•
Allow execu+on in in parallel over large datasets
Collabora+ve filtering • •
Collec+ng ‘taste’ informa+on from many users U+lizing informa+on to predict what similar users like
Typical Industry • •
Ecommerce, Manufacturing, Retail Adver+sing
4. Point of Sale Transac+on Analysis Challenge: • Analyzing Point of Sale (PoS) data to target promo+ons and manage opera+ons •
Sources are complex and data volumes grow across chains of stores and other sources
Solu+on with Hadoop: • Batch processing framework •
•
Allow execu+on in in parallel over large datasets
Pa]ern recogni+on • •
Op+mizing over mul+ple data sources U+lizing informa+on to predict demand
Typical Industry: •
Retail
5. Analyzing Network Data to Predict Failure Challenge: • Analyzing real-‐+me data series from a network of sensors •
Calcula+ng average frequency over +me is extremely tedious because of the need to analyze terabytes
Solu+on with Hadoop: • Take the computa+on to the data •
•
Expand from simple scans to more complex data mining
Be]er understand how the network reacts to fluctua+ons •
Discrete anomalies may, in fact, be interconnected
Iden+fy leading indicators of component failure Typical Industry: •
•
U+li+es, Telecommunica+ons, Data Centers
6. Threat Analysis/Trade Surveillance Challenge: • Detec+ng threats in the form of fraudulent ac+vity or a]acks • •
Large data volumes involved Like looking for a needle in a haystack
Solu+on with Hadoop: • Parallel processing over huge datasets • Pa]ern recogni+on to iden+fy anomalies, •
i.e., threats
Typical Industry: •
Security, Financial Services, General: spam figh+ng, click fraud
7. Search Quality Challenge: • Providing real +me meaningful search results Solu+on with Hadoop: • Analyzing search a]empts in conjunc+on with structured data • Pa]ern recogni+on •
Browsing pa]ern of users performing searches in different categories
Typical Industry: •
Web, Ecommerce
8. Data “Sandbox” Challenge: • Data Deluge •
Don’t know what to do with the data or what analysis to run
Solu+on with Hadoop: • “Dump” all this data into an HDFS cluster • Use Hadoop to start trying out different analysis on the data • See pa]erns to derive value from data Typical Industry: •
Common across all industries
Hadoop: How does it work? •
Moore’s law… and not
Disk Capacity and Price • •
We’re genera+ng more data than ever before Fortunately, the size and cost of storage has kept pace •
Capacity has increased while price has decreased Year
Capacity (GB)
Cost per GB (USD)
1997
2.1
$157
2004
200
$1.05
2012
3,000
$0.05
Disk Capacity and Performance • •
Disk performance has also increased in the last 15 years Unfortunately, transfer rates haven’t kept pace with capacity Year
Capacity (GB)
Transfer Rate (MB/s)
Disk Read Time
1997
2.1
16.6
126 seconds
2004
200
56.5
59 minutes
2012
3,000
210
3 hours, 58 minutes
Architecture of a Typical HPC System Compute Nodes
Storage System
Fast Network
Architecture of a Typical HPC System Compute Nodes
Storage System
Step 1: Copy input data
Fast Network
Architecture of a Typical HPC System Compute Nodes
Storage System
Step 2: Process the data
Fast Network
Architecture of a Typical HPC System Compute Nodes
Storage System
Step 3: Copy output data
Fast Network
You Don’t Just Need Speed… •
The problem is that we have way more data than code $ du -ks code/ 1,083 $ du –ks data/ 854,632,947,314
You Need Speed At Scale Compute Nodes
Storage System
Bottleneck
HDFS: HADOOP DISTRIBUTED FILESYSTEM Because 10,000 hard disks are be]er than one
Collocated Storage and Processing •
Solu+on: store and process data on the same nodes • •
Data locality: “Bring the computa+on to the data” Reduces I/O and boosts performance
"slave" nodes (storage and processing)
Hard Disk Latency • •
Disk seeks are expensive Solu+on: Read lots of data at once to amor+ze the cost Current location of disk head
Where the data you need is stored
Introducing HDFS •
Hadoop Distributed File System •
•
Scalable storage influenced by Google’s file system paper
It’s not a general-‐purpose filesystem • • • •
HDFS is op+mized for Hadoop Values high throughput much more than low latency It’s a user-‐space Java process Primarily accessed via command-‐line u+li+es and Java API
HDFS is (Mostly) UNIX-‐like •
In many ways, HDFS is similar to a UNIX filesystem • • •
•
Hierarchical UNIX-‐style paths (e.g. /foo/bar/myfile.txt) File ownership and permissions
There are also some major devia+ons from UNIX • •
No CWD Cannot modify files once wri]en
HDFS High-‐Level Architecture • •
HDFS follows a master-‐slave architecture There are two essen+al daemons in HDFS •
Master: NameNode • • •
•
Responsible for namespace and metadata Namespace: file hierarchy Metadata: ownership, permissions, block loca+ons, etc.
Slave: DataNode •
Responsible for storing actual datablocks
Anatomy of a Small Hadoop Cluster The diagram shows the HDFS-‐related daemons on a small cluster
Each "slave" node will run - DataNode daemon
The "master" node will run - NameNode daemon
HDFS Blocks • •
When a file is added to HDFS, it’s split into blocks This is a similar concept to na+ve filesystems •
HDFS uses a much larger block size (64 MB), for performance 150 MB input file
Block #1 (64 MB)
Block #2 (64 MB)
Block #3 (remaining 22 MB)
HDFS Replica+on • •
Those blocks are then replicated across machines The first block might be replicated to A, C and D Block #1
A B
Block #2 C
Block #3
D E
HDFS Replica+on (cont’d) •
The next block might be replicated to B, D and E Block #1
A B
Block #2 C
Block #3
D E
HDFS Replica+on (cont’d) •
The last block might be replicated to A, C and E Block #1
A B
Block #2 C
Block #3
D E
HDFS Reliability •
Replica+on helps to achieve reliability • •
Even when a node fails, two copies of the block remain These will be re-‐replicated to other nodes automa+cally This failed node held blocks #1 and #3
X A B
Blocks #1 and #3 are still available here
C
Block #1 is still available here
D
Block #3 is still available here
E
DATA PROCESSING WITH MAPREDUCE It not only works, it’s func+onal
MapReduce High-‐Level Architecture • •
Like HDFS, MapReduce has a master-‐slave architecture There are two daemons in “classical” MapReduce •
Master: JobTracker •
•
Responsible for dividing, scheduling and monitoring work
Slave: TaskTracker •
Responsible for actual processing
Anatomy of a Small Hadoop Cluster The diagram shows both MapReduce and HDFS daemons
Each "slave" node will run - DataNode daemon - TaskTracker daemon
The "master" node will run - NameNode daemon - JobTracker daemon
Gentle Introduc+on to MapReduce •
MapReduce is conceptually like a UNIX pipeline • • •
One func+on (Map) processes data That output is ul+mately input to another func+on (Reduce) Each piece is simple, but can be powerful when combined
$ egrep 941 78264 4312
'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c ERROR INFO WARN
The Map Func+on •
Operates on each record individually Typical uses include filtering, parsing, or transforming input egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c •
$
Map
Intermediate Processing •
The Map func+on’s output is grouped and sorted •
This is the automa+c “sort and shuffle” process in Hadoop
$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Map
Sort and Shuffle
The Reduce Func+on •
Operates on all records in a group •
Oren used for sum, average or other aggregate func+ons
$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Map
Sort and Reduce shuffle
MapReduce History •
MapReduce is not a language, it’s a programming model •
•
MapReduce has its roots in func+onal programming • •
•
A style of processing data you could implement in any language Many languages have func+ons named map and reduce These func+ons have largely the same purpose in Hadoop
Popularized for large-‐scale data processing by Google
MapReduce Benefits •
Complex details are abstracted away from the developer • • •
•
No file I/O No networking code No synchroniza+on
It’s scalable because you process one record at a +me •
A record consists of a key and corresponding value •
We oren care about only one of these
MapReduce Example in Python •
MapReduce code for Hadoop is typically wri]en in Java • •
•
But possible to use nearly any language with Hadoop Streaming I’ll show the log event counter using MapReduce in Python
It’s very helpful to see the data as well as the code
Job Input •
Each mapper gets a chunk of job’s input data to process This “chunk” is called an InputSplit • 2012-09-06 In most cases, this corresponds to a b"This lock in H DFS wait" 22:16:49.391 CDT INFO can •
2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06
22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399
CDT CDT CDT CDT CDT CDT
INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"
Python Code for Map Func+on •
Our map func+on will parse the event type •
1 2 3 4 5 6 7 8 9 10 11 12 13
And then output that event (key) and a literal 1 (value)
#!/usr/bin/env python
Boilerplate Python stuff
import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL']
Define list of JUnit log levels
for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field
Split every line (record) we receive on standard input into fields, normalized by case If this field matches a log level, print it (and a 1)
Output of Map Func+on •
The map func+on produces key/value pairs as output INFO INFO WARN INFO WARN INFO ERROR
1 1 1 1 1 1 1
Input to Reduce Func+on •
The Reducer receives a key and all values for that key • •
Keys are always passed to reducers in sorted order Although it’s not obvious here, values are unordered ERROR INFO INFO INFO INFO WARN WARN
1 1 1 1 1 1 1
Python Code for Reduce Func+on • 1 2 3 4 5 6 7 8 9 10 11 12 13
The Reducer first extracts the key and value it was passed #!/usr/bin/env python
Boilerplate Python stuff
import sys previous_key = '' sum = 0
Ini+alize loop variables
for line in sys.stdin: fields = line.split() key, value = line.split()
Extract the key and value passed via standard input
value = int(value) # continued on next slide
Python Code for Reduce Func+on • 14 15 16 17 18 19 20 21 22 23
Then simply adds up the value for each key # continued from previous slide if key == previous_key: sum = sum + value else: if previous_key != '': print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum)
If key unchanged, increment the count If key changed, print sum for previous key Re-‐init loop variables Print sum for final key
Output of Reduce Func+on •
The output of this Reduce func+on is a sum for each level ERROR INFO WARN
1 4 2
Recap of Data Flow
Map input 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06
Map output INFO INFO WARN INFO WARN INFO ERROR
1 1 1 1 1 1 1
22:16:49.391 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399
CDT CDT CDT CDT CDT CDT CDT
INFO "This can wait" INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"
Reduce input ERROR INFO INFO INFO INFO WARN WARN
1 1 1 1 1 1 1
Reduce output ERROR INFO WARN
1 4 2
Input Splits Feed the Map Tasks •
Input for the en+re job is subdivided into InputSplits • •
An InputSplit usually corresponds to a single HDFS block Each of these serves as input to a single Map task Input for entire job (192 MB)
64 MB
Mapper #1
64 MB
Mapper #2
64 MB
Mapper #3
Mappers Feed the Shuffle and Sort •
Output of all Mappers is par++oned, merged, and sorted (No code required – Hadoop does this automa+cally) 1 1 1 1 1
ERROR ERROR ERROR
1 1 1
Mapper #1
INFO WARN INFO INFO ERROR
Mapper #2
WARN INFO INFO INFO ERROR
1 1 1 1 1
INFO INFO INFO INFO INFO INFO INFO INFO
1 1 1 1 1 1 1 1
Mapper #N
WARN INFO WARN INFO ERROR
1 1 1 1 1
WARN WARN WARN WARN
1 1 1 1
Shuffle and Sort Feeds the Reducers •
All values for a given key are then collapsed into a list •
The key and all its values are fed to reducers as input ERROR ERROR ERROR
1 1 1
INFO INFO INFO INFO INFO INFO INFO INFO
1 1 1 1 1 1 1 1
INFO
1 1 1 1 1 1 1 1
WARN WARN WARN WARN
1 1 1 1
WARN
1 1 1 1
ERROR
1 1 1
Reducer #1
Reducer #2
Each Reducer Has an Output File •
These are stored in HDFS below your output directory •
Use hadoop fs -getmerge to combine them into a local copy INFO
8
ERROR WARN
3 4
Reducer #1
Reducer #2
Apache Hadoop Ecosystem: Overview •
"Core Hadoop" consists of HDFS and MapReduce •
•
Hadoop has many related projects • • •
•
These are the kernel of a much broader plavorm
Some help you integrate Hadoop with other systems Others help you analyze your data S+ll others, like Oozie, help you use Hadoop more effec+vely
Most are open source Apache projects like Hadoop • •
Also like Hadoop, they have funny names All of these are part of Cloudera’s CDH distribu+on
Ecosystem: Apache Flume log files
program output syslog
custom source and many more…
Ecosystem: Apache Sqoop •
Integrates with any JDBC-‐compa+ble database • •
Retrieve all tables, a single table, or a por+on to store in HDFS Can also export data from HDFS back to the database Database Hadoop Cluster
Ecosystem: Apache Hive •
Hive allows you to do SQL-‐like queries on data in HDFS • • •
It turns this into MapReduce jobs that run on your cluster Reduces development +me Makes Hadoop more accessible to non-‐engineers
SELECT customer.id, customer.name, sum(orders.cost) FROM customers INNER JOIN ON (customer.id = orders.customer_id) WHERE customer.zipcode = '63105' GROUP BY customer.id;
Ecosystem: Apache Pig •
Apache Pig has a similar purpose to Hive • •
•
It has a high-‐level language (PigLa+n) for data analysis Scripts yield MapReduce jobs that run on your cluster
But Pig’s approach is much different than Hive
Ecosystem: Apache HBase • • •
NoSQL database built on HDFS Low-‐latency and high-‐performance for reads and writes Extremely scalable • •
Tables can have billions of rows And poten+ally millions of columns
You Should Be Using CDH •
Cloudera’s Distribu+on including Apache Hadoop (CDH) • •
•
Combines Hadoop with many important ecosystem tools • •
•
The most widely used distribu+on of Hadoop A stable, proven and supported environment you can count on Such as Hive, Pig, Sqoop, Flume and many more All of these are integrated and work well together
How much does it cost? • •
It’s completely free Apache licensed – it’s 100% open source too
When is Hadoop (Not) a Good Choice •
Hadoop may be a great choice when • • •
•
Hadoop may not be a great choice when • • •
•
You need to process non-‐rela+onal (unstructured) data You are processing large amounts of data You can run your jobs in batch mode You’re processing small amounts of data Your algorithms require communica+on among nodes You need low latency or transac+ons
As always, use the best tool for the job •
And know how to integrate it with other systems
Managing The Elephant In The Room -‐ Roles • • • •
System Administrators Developers Analysts Data Stewards
System Administrators •
Required skills: • • •
•
Strong Linux administra+on skills Networking knowledge Understanding of hardware
Job responsibili+es • • • •
Install, configure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)
Developers •
Required Skills: • •
•
Strong Java or scrip+ng capabili+es Understanding of MapReduce and algorithms
Job responsibili+es: • •
Write, package and deploy MapReduce programs Op+mize MapReduce jobs and Hive/Pig programs
Data Analyst/Business Analyst •
Required skills: • •
•
SQL Understanding data analy+cs/data mining
Job responsibili+es: • •
Extract intelligence from the data Write Hive and/or Pig programs
Data Steward •
Required skills: • •
•
Data modeling and ETL Scrip+ng skills
Job responsibili+es: • • •
Cataloging the data (analogous to a librarian for books) Manage data lifecycle, reten+on Data quality control with SLAs
Combining Roles • •
System Administrator + Steward analogous to DBA Required skills: • • •
•
Data modeling and ETL Scrip+ng skills Strong Linux administra+on skills
Job responsibili+es: • • • • • •
Manage data lifecycle, reten+on Data quality control with SLAs Install, configure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)
Conclusion • •
Thanks for your +me! Ques+ons?