An Introduction to Hadoop Presentation.pdf

An Introduc+on to Hadoop Mark Fei

Cloudera Strata + Hadoop World 2012, New York City, October 23, 2012

1

Who Am I? Mark Fei Cloudera! Durango, Colorado!

Current:! Past:!

Senior Instructor at Cloudera! Professional Services – Education, VMware! Senior Member Technical Staff, Hill Associates! Sales Engineer, Nortel Networks! Systems Programmer, large Bank! Banking Applications software developer!

What’s Ahead? • 

Solid introduc+on to Apache Hadoop •  •  •  • 

•  • 

What it is Why it’s relevant How it works The Ecosystem

No prior experience needed Feel free to ask ques+ons

What is Apache Hadoop? • 

Scalable data storage and processing •  •  • 

• 

Open source Apache project Harnesses the power of commodity servers Distributed and fault-‐tolerant

“Core” Hadoop consists of two main parts •  • 

HDFS (storage) MapReduce (processing)

A large ecosystem

Who uses Hadoop?

Vendor integration

BI / Analytics

ETL

Database

OS / Cloud / System Mgmt.

Hardware

About Cloudera •  •  •  • 

Cloudera is “The commercial Hadoop company” Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects

Cloudera Software • 

Cloudera’s Distribution including Apache Hadoop (CDH) •  •  • 

• 

A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version 100% open source

Components •  •  •  •  •  • 

Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout

A Coherent Platform Components of the CDH Stack File System Mount

UI Framework

FUSE-DFS

Workflow APACHE OOZIE

SDK HUE

Scheduling APACHE OOZIE

Languages / Compilers Data Integration

Storage

APACHE PIG, APACHE HIVE, APACHE MAHOUT

APACHE FLUME, APACHE SQOOP

HUE SDK

Computation

Metadata APACHE HIVE

Fast Read/Write Access

Integration Coordination

APACHE HBASE HDFS, MAPREDUCE

Coordination

APACHE ZOOKEEPER

Access

Cloudera Manager, Free Edition • 

End-to-end Deployment and management of your CDH cluster • 

•  • 

‘Zero to Hadoop in 15 minutes’

Supports up to 50 nodes Free (but not open source)

Cloudera Enterprise • 

Cloudera Enterprise • 

Cloudera’s Distribution including Apache Hadoop (CDH) • 

• 

Cloudera Manager (full version) •  • 

• 

Big data storage, processing and analytics platform based on CDH End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the free version

Production support • 

A team of experts on call to help you meet your Service Level Agreements (SLAs)

Cloudera University • 

Training for the entire Hadoop stack •  •  •  •  •  • 

• 

Public and private classes offered • 

• 

Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming Including customized on-site private classes

Industry-recognized Certifications •  •  • 

Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)

Professional Services • 

Solutions Architects provide guidance and handson expertise •  •  •  •  •  • 

Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification

How Did Apache Hadoop Originate? • 

Heavily influenced by Google’s architecture • 

• 

Notably, the Google Filesystem and MapReduce papers

Other Web companies quickly saw the benefits • 

Early adop+on by Yahoo, Facebook and others Nutch spun off from Lucene

2002

Google publishes GFS paper

2003

Google publishes MapReduce paper

Nutch rewritten for MapReduce

2004

2005

Why Do We Have So Much Data? • 

And what are we supposed to do with it?

Velocity • 

Why we’re genera+ng data faster than ever •  •  • 

Processes are increasingly automated Systems are increasingly interconnected People are increasingly interac+ng online

Variety • 

What types of data are we producing? •  •  •  •  • 

• 

Applica+on logs Text messages Social network connec+ons Tweets Photos

Not all of this maps cleanly to the rela+onal model

Volume • 

The result of this is that every single day •  •  • 

• 

Twi]er processes 340 million messages Facebook stores 2.7 billion comments and “Likes” Google processes about 24 petabytes of data

And every single minute •  • 

More than 200 million e-‐mail messages are sent Foursquare processes more than 2,000 check-‐ins

Where Does Data Come From? • 

Science • 

• 

Industry • 

• 

Financial, pharmaceu+cal, manufacturing, insurance, online, energy, retail data

Legacy • 

• 

Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc.

Sales data, customer behavior, product databases, accoun+ng data, etc.

System Data • 

Log files, health & status feeds, ac+vity streams, network messages, Web Analy+cs, intrusion detec+on, spam filters

Analyzing Data: The Challenges •  • 

Huge volumes of data Mixed sources result in many different formats •  •  •  •  •  •  •  •  •  • 

XML CSV EDI Log files Objects SQL Text JSON Binary Etc.

What is Common Across Hadoop-‐able Problems? • 

Nature of the data •  •  • 

• 

Complex data Mul+ple data sources Lots of it

Nature of the analysis •  •  • 

Batch processing Parallel execu+on Spread data over a cluster of servers and take the computa+on to the data

Benefits of Analyzing With Hadoop •  •  •  •  • 

Previously impossible/imprac+cal to do this analysis Analysis conducted at lower cost Analysis conducted in less +me Greater flexibility Linear scalability

What Analysis is Possible With Hadoop? • 

Text mining

• 

Collabora+ve filtering

• 

Index building

• 

Predic+on models

• 

Graph crea+on and analysis

• 

Sen+ment analysis

• 

Pa]ern recogni+on

• 

Risk assessment

Eight Common Hadoop-‐able Problems 1. 

Modeling true risk

2. 

Customer churn analysis

3.  4. 

Recommenda+on engine PoS transac+on analysis

5. 

Analyzing network data to predict failure

6. 

Threat analysis

7. 

Search quality

8. 

Data “sandbox”

1. Modeling True Risk Challenge: •  How much risk exposure does an organiza+on really have with each customer? • 

Mul+ple sources of data and across mul+ple lines of business

Solu+on with Hadoop: •  Source and aggregate disparate data sources to build data picture • 

• 

e.g. credit card records, call recordings, chat sessions, emails, banking ac+vity

Structure and analyze • 

Sen+ment analysis, graph crea+on, pa]ern recogni+on

Typical Industry: • 

Financial Services (banks, insurance companies)

2. Customer Churn Analysis Challenge: •  Why is an organiza+on really losing customers? • 

Data on these factors comes from different sources

Solu-on with Hadoop: •  Rapidly build behavioral model from disparate data sources •  Structure and analyze with Hadoop •  •  • 

Traversing Graph crea+on Pa]ern recogni+on


Telecommunica+ons, Financial Services

3. Recommenda+on Engine/Ad Targe+ng Challenge: •  Using user data to predict which products to recommend Solu+on with Hadoop: •  Batch processing framework • 

• 

Allow execu+on in in parallel over large datasets

Collabora+ve filtering •  • 

Collec+ng ‘taste’ informa+on from many users U+lizing informa+on to predict what similar users like

Typical Industry •  • 

Ecommerce, Manufacturing, Retail Adver+sing

4. Point of Sale Transac+on Analysis Challenge: •  Analyzing Point of Sale (PoS) data to target promo+ons and manage opera+ons • 

Sources are complex and data volumes grow across chains of stores and other sources

Solu+on with Hadoop: •  Batch processing framework • 

• 

Allow execu+on in in parallel over large datasets

Pa]ern recogni+on •  • 

Op+mizing over mul+ple data sources U+lizing informa+on to predict demand


Retail

5. Analyzing Network Data to Predict Failure Challenge: •  Analyzing real-‐+me data series from a network of sensors • 

Calcula+ng average frequency over +me is extremely tedious because of the need to analyze terabytes

Solu+on with Hadoop: •  Take the computa+on to the data • 

• 

Expand from simple scans to more complex data mining

Be]er understand how the network reacts to fluctua+ons • 

Discrete anomalies may, in fact, be interconnected

Iden+fy leading indicators of component failure Typical Industry: • 

• 

U+li+es, Telecommunica+ons, Data Centers

6. Threat Analysis/Trade Surveillance Challenge: •  Detec+ng threats in the form of fraudulent ac+vity or a]acks •  • 

Large data volumes involved Like looking for a needle in a haystack

Solu+on with Hadoop: •  Parallel processing over huge datasets •  Pa]ern recogni+on to iden+fy anomalies, • 

i.e., threats


Security, Financial Services, General: spam figh+ng, click fraud

7. Search Quality Challenge: •  Providing real +me meaningful search results Solu+on with Hadoop: •  Analyzing search a]empts in conjunc+on with structured data •  Pa]ern recogni+on • 

Browsing pa]ern of users performing searches in different categories


Web, Ecommerce

8. Data “Sandbox” Challenge: •  Data Deluge • 

Don’t know what to do with the data or what analysis to run

Solu+on with Hadoop: •  “Dump” all this data into an HDFS cluster •  Use Hadoop to start trying out different analysis on the data •  See pa]erns to derive value from data Typical Industry: • 

Common across all industries

Hadoop: How does it work? • 

Moore’s law… and not

Disk Capacity and Price •  • 

We’re genera+ng more data than ever before Fortunately, the size and cost of storage has kept pace • 

Capacity has increased while price has decreased Year

Capacity (GB)

Cost per GB (USD)

1997

2.1

$157

2004

200

$1.05

2012

3,000

$0.05

Disk Capacity and Performance •  • 

Disk performance has also increased in the last 15 years Unfortunately, transfer rates haven’t kept pace with capacity Year

Capacity (GB)

Transfer Rate (MB/s)

Disk Read Time

1997

2.1

16.6

126 seconds

2004

200

56.5

59 minutes

2012

3,000

210

3 hours, 58 minutes

Architecture of a Typical HPC System Compute Nodes

Storage System

Fast Network


Storage System

Step 1: Copy input data

Fast Network


Storage System

Step 2: Process the data

Fast Network


Storage System

Step 3: Copy output data

Fast Network

You Don’t Just Need Speed… • 

The problem is that we have way more data than code $ du -ks code/ 1,083 $ du –ks data/ 854,632,947,314

You Need Speed At Scale Compute Nodes

Storage System

Bottleneck

HDFS: HADOOP DISTRIBUTED FILESYSTEM Because 10,000 hard disks are be]er than one

Collocated Storage and Processing • 

Solu+on: store and process data on the same nodes •  • 

Data locality: “Bring the computa+on to the data” Reduces I/O and boosts performance

"slave" nodes (storage and processing)

Hard Disk Latency •  • 

Disk seeks are expensive Solu+on: Read lots of data at once to amor+ze the cost Current location of disk head

Where the data you need is stored

Introducing HDFS • 

Hadoop Distributed File System • 

• 

Scalable storage influenced by Google’s file system paper

It’s not a general-‐purpose filesystem •  •  •  • 

HDFS is op+mized for Hadoop Values high throughput much more than low latency It’s a user-‐space Java process Primarily accessed via command-‐line u+li+es and Java API

HDFS is (Mostly) UNIX-‐like • 

In many ways, HDFS is similar to a UNIX filesystem •  •  • 

• 

Hierarchical UNIX-‐style paths (e.g. /foo/bar/myfile.txt) File ownership and permissions

There are also some major devia+ons from UNIX •  • 

No CWD Cannot modify files once wri]en

HDFS High-‐Level Architecture •  • 

HDFS follows a master-‐slave architecture There are two essen+al daemons in HDFS • 

Master: NameNode •  •  • 

• 

Responsible for namespace and metadata Namespace: file hierarchy Metadata: ownership, permissions, block loca+ons, etc.

Slave: DataNode • 

Responsible for storing actual datablocks

Anatomy of a Small Hadoop Cluster The diagram shows the HDFS-‐related daemons on a small cluster

Each "slave" node will run - DataNode daemon

The "master" node will run - NameNode daemon

HDFS Blocks •  • 

When a file is added to HDFS, it’s split into blocks This is a similar concept to na+ve filesystems • 

HDFS uses a much larger block size (64 MB), for performance 150 MB input file

Block #1 (64 MB)

Block #2 (64 MB)

Block #3 (remaining 22 MB)

HDFS Replica+on •  • 

Those blocks are then replicated across machines The first block might be replicated to A, C and D Block #1

A B

Block #2 C

Block #3

D E

HDFS Replica+on (cont’d) • 

The next block might be replicated to B, D and E Block #1

A B

Block #2 C

Block #3

D E

HDFS Replica+on (cont’d) • 

The last block might be replicated to A, C and E Block #1

A B

Block #2 C

Block #3

D E

HDFS Reliability • 

Replica+on helps to achieve reliability •  • 

Even when a node fails, two copies of the block remain These will be re-‐replicated to other nodes automa+cally This failed node held blocks #1 and #3

X A B

Blocks #1 and #3 are still available here

C

Block #1 is still available here

D

Block #3 is still available here

E

DATA PROCESSING WITH MAPREDUCE It not only works, it’s func+onal

MapReduce High-‐Level Architecture •  • 

Like HDFS, MapReduce has a master-‐slave architecture There are two daemons in “classical” MapReduce • 

Master: JobTracker • 

• 

Responsible for dividing, scheduling and monitoring work

Slave: TaskTracker • 

Responsible for actual processing

Anatomy of a Small Hadoop Cluster The diagram shows both MapReduce and HDFS daemons

Each "slave" node will run - DataNode daemon - TaskTracker daemon

The "master" node will run - NameNode daemon - JobTracker daemon

Gentle Introduc+on to MapReduce • 

MapReduce is conceptually like a UNIX pipeline •  •  • 

One func+on (Map) processes data That output is ul+mately input to another func+on (Reduce) Each piece is simple, but can be powerful when combined

$ egrep 941 78264 4312

'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c ERROR INFO WARN

The Map Func+on • 

Operates on each record individually Typical uses include filtering, parsing, or transforming input egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c • 

$

Map

Intermediate Processing • 

The Map func+on’s output is grouped and sorted • 

This is the automa+c “sort and shuffle” process in Hadoop

$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

Map

Sort and Shuffle

The Reduce Func+on • 

Operates on all records in a group • 

Oren used for sum, average or other aggregate func+ons

$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

Map

Sort and Reduce shuffle

MapReduce History • 

MapReduce is not a language, it’s a programming model • 

• 

MapReduce has its roots in func+onal programming •  • 

• 

A style of processing data you could implement in any language Many languages have func+ons named map and reduce These func+ons have largely the same purpose in Hadoop

Popularized for large-‐scale data processing by Google

MapReduce Benefits • 

Complex details are abstracted away from the developer •  •  • 

• 

No file I/O No networking code No synchroniza+on

It’s scalable because you process one record at a +me • 

A record consists of a key and corresponding value • 

We oren care about only one of these

MapReduce Example in Python • 

MapReduce code for Hadoop is typically wri]en in Java •  • 

• 

But possible to use nearly any language with Hadoop Streaming I’ll show the log event counter using MapReduce in Python

It’s very helpful to see the data as well as the code

Job Input • 

Each mapper gets a chunk of job’s input data to process This “chunk” is called an InputSplit • 2012-09-06 In most cases, this corresponds to a b"This lock in H DFS wait" 22:16:49.391 CDT INFO can • 

2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06

22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399

CDT CDT CDT CDT CDT CDT

INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"

Python Code for Map Func+on • 

Our map func+on will parse the event type • 

1 2 3 4 5 6 7 8 9 10 11 12 13

And then output that event (key) and a literal 1 (value)

#!/usr/bin/env python

Boilerplate Python stuff

import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL']

Define list of JUnit log levels

for line in sys.stdin: fields = line.split() for field in fields: field = field.strip().upper() if field in levels: print "%s\t1" % field

Split every line (record) we receive on standard input into fields, normalized by case If this field matches a log level, print it (and a 1)

Output of Map Func+on • 

The map func+on produces key/value pairs as output INFO INFO WARN INFO WARN INFO ERROR

1 1 1 1 1 1 1

Input to Reduce Func+on • 

The Reducer receives a key and all values for that key •  • 

Keys are always passed to reducers in sorted order Although it’s not obvious here, values are unordered ERROR INFO INFO INFO INFO WARN WARN

1 1 1 1 1 1 1

Python Code for Reduce Func+on •  1 2 3 4 5 6 7 8 9 10 11 12 13

The Reducer first extracts the key and value it was passed #!/usr/bin/env python

Boilerplate Python stuff

import sys previous_key = '' sum = 0

Ini+alize loop variables

for line in sys.stdin: fields = line.split() key, value = line.split()

Extract the key and value passed via standard input

value = int(value) # continued on next slide

Python Code for Reduce Func+on •  14 15 16 17 18 19 20 21 22 23

Then simply adds up the value for each key # continued from previous slide if key == previous_key: sum = sum + value else: if previous_key != '': print '%s\t%i' % (previous_key, sum) previous_key = key sum = 1 print '%s\t%i' % (previous_key, sum)

If key unchanged, increment the count If key changed, print sum for previous key Re-‐init loop variables Print sum for final key

Output of Reduce Func+on • 

The output of this Reduce func+on is a sum for each level ERROR INFO WARN

1 4 2

Recap of Data Flow

Map input 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06 2012-09-06

Map output INFO INFO WARN INFO WARN INFO ERROR

1 1 1 1 1 1 1

22:16:49.391 22:16:49.392 22:16:49.394 22:16:49.395 22:16:49.397 22:16:49.398 22:16:49.399

CDT CDT CDT CDT CDT CDT CDT

INFO "This can wait" INFO "Blah blah" WARN "Hmmm..." INFO "More blather" WARN "Hey there" INFO "Spewing data" ERROR "Oh boy!"

Reduce input ERROR INFO INFO INFO INFO WARN WARN

1 1 1 1 1 1 1

Reduce output ERROR INFO WARN

1 4 2

Input Splits Feed the Map Tasks • 

Input for the en+re job is subdivided into InputSplits •  • 

An InputSplit usually corresponds to a single HDFS block Each of these serves as input to a single Map task Input for entire job (192 MB)

64 MB

Mapper #1

64 MB

Mapper #2

64 MB

Mapper #3

Mappers Feed the Shuffle and Sort • 

Output of all Mappers is par++oned, merged, and sorted (No code required – Hadoop does this automa+cally) 1 1 1 1 1

ERROR ERROR ERROR

1 1 1

Mapper #1

INFO WARN INFO INFO ERROR

Mapper #2

WARN INFO INFO INFO ERROR

1 1 1 1 1

INFO INFO INFO INFO INFO INFO INFO INFO

1 1 1 1 1 1 1 1

Mapper #N

WARN INFO WARN INFO ERROR

1 1 1 1 1

WARN WARN WARN WARN

1 1 1 1

Shuffle and Sort Feeds the Reducers • 

All values for a given key are then collapsed into a list • 

The key and all its values are fed to reducers as input ERROR ERROR ERROR

1 1 1

INFO INFO INFO INFO INFO INFO INFO INFO

1 1 1 1 1 1 1 1

INFO

1 1 1 1 1 1 1 1

WARN WARN WARN WARN

1 1 1 1

WARN

1 1 1 1

ERROR

1 1 1

Reducer #1

Reducer #2

Each Reducer Has an Output File • 

These are stored in HDFS below your output directory • 

Use hadoop fs -getmerge to combine them into a local copy INFO

8

ERROR WARN

3 4

Reducer #1

Reducer #2

Apache Hadoop Ecosystem: Overview • 

"Core Hadoop" consists of HDFS and MapReduce • 

• 

Hadoop has many related projects •  •  • 

• 

These are the kernel of a much broader plavorm

Some help you integrate Hadoop with other systems Others help you analyze your data S+ll others, like Oozie, help you use Hadoop more effec+vely

Most are open source Apache projects like Hadoop •  • 

Also like Hadoop, they have funny names All of these are part of Cloudera’s CDH distribu+on

Ecosystem: Apache Flume log files

program output syslog

custom source and many more…

Ecosystem: Apache Sqoop • 

Integrates with any JDBC-‐compa+ble database •  • 

Retrieve all tables, a single table, or a por+on to store in HDFS Can also export data from HDFS back to the database Database Hadoop Cluster

Ecosystem: Apache Hive • 

Hive allows you to do SQL-‐like queries on data in HDFS •  •  • 

It turns this into MapReduce jobs that run on your cluster Reduces development +me Makes Hadoop more accessible to non-‐engineers

SELECT customer.id, customer.name, sum(orders.cost) FROM customers INNER JOIN ON (customer.id = orders.customer_id) WHERE customer.zipcode = '63105' GROUP BY customer.id;

Ecosystem: Apache Pig • 

Apache Pig has a similar purpose to Hive •  • 

• 

It has a high-‐level language (PigLa+n) for data analysis Scripts yield MapReduce jobs that run on your cluster

But Pig’s approach is much different than Hive

Ecosystem: Apache HBase •  •  • 

NoSQL database built on HDFS Low-‐latency and high-‐performance for reads and writes Extremely scalable •  • 

Tables can have billions of rows And poten+ally millions of columns

You Should Be Using CDH • 

Cloudera’s Distribu+on including Apache Hadoop (CDH) •  • 

• 

Combines Hadoop with many important ecosystem tools •  • 

• 

The most widely used distribu+on of Hadoop A stable, proven and supported environment you can count on Such as Hive, Pig, Sqoop, Flume and many more All of these are integrated and work well together

How much does it cost? •  • 

It’s completely free Apache licensed – it’s 100% open source too

When is Hadoop (Not) a Good Choice • 

Hadoop may be a great choice when •  •  • 

• 

Hadoop may not be a great choice when •  •  • 

• 

You need to process non-‐rela+onal (unstructured) data You are processing large amounts of data You can run your jobs in batch mode You’re processing small amounts of data Your algorithms require communica+on among nodes You need low latency or transac+ons

As always, use the best tool for the job • 

And know how to integrate it with other systems

Managing The Elephant In The Room -‐ Roles •  •  •  • 

System Administrators Developers Analysts Data Stewards

System Administrators • 

Required skills: •  •  • 

• 

Strong Linux administra+on skills Networking knowledge Understanding of hardware

Job responsibili+es •  •  •  • 

Install, configure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)

Developers • 

Required Skills: •  • 

• 

Strong Java or scrip+ng capabili+es Understanding of MapReduce and algorithms

Job responsibili+es: •  • 

Write, package and deploy MapReduce programs Op+mize MapReduce jobs and Hive/Pig programs

Data Analyst/Business Analyst • 

Required skills: •  • 

• 

SQL Understanding data analy+cs/data mining

Job responsibili+es: •  • 

Extract intelligence from the data Write Hive and/or Pig programs

Data Steward • 

Required skills: •  • 

• 

Data modeling and ETL Scrip+ng skills

Job responsibili+es: •  •  • 

Cataloging the data (analogous to a librarian for books) Manage data lifecycle, reten+on Data quality control with SLAs

Combining Roles •  • 

System Administrator + Steward analogous to DBA Required skills: •  •  • 

• 

Data modeling and ETL Scrip+ng skills Strong Linux administra+on skills

Job responsibili+es: •  •  •  •  •  • 

Manage data lifecycle, reten+on Data quality control with SLAs Install, configure and upgrade Hadoop sorware Manage hardware components Monitor the cluster Integrate with other systems (e.g., Flume and Sqoop)

Conclusion •  • 

Thanks for your +me! Ques+ons?

An Introduction to Hadoop Presentation.pdf

Recommend Documents