HADOOP A Seminar Report
Submitted to MR. RAMESH KUMAR
Submitted by AVANTIKA BISHT UNIVERSITY ROLL NO:150216
in partial fulfillment for the award of the degree of BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND ENGINEERING at GOVIND BALLABH PANT INSTITUTE OF ENGINEERING AND TECHNOLOGY Department of Computer Science Pauri Garhwal FEBRUARY 2018
ABSTRACT
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. mo dels. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. The main reason for its development was advancement of big data. Big data is a term used to describe the voluminous amount of data that would cause too much time and cost to load into a relational database for analysis. In a nutshell, is what Hadoop provides: a reliable shared storage and analysis System. The storage is provided by HDFS, and analysis by Map Reduce. The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later. Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently, uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.
II
ABSTRACT
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. mo dels. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. The main reason for its development was advancement of big data. Big data is a term used to describe the voluminous amount of data that would cause too much time and cost to load into a relational database for analysis. In a nutshell, is what Hadoop provides: a reliable shared storage and analysis System. The storage is provided by HDFS, and analysis by Map Reduce. The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later. Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently, uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.
II
Contents 1.Reason To Hadoop .................................................. ..................................................... ................................................................. ............ - 3 What Is Big Data Big Data Challenges Google’s Solution Hadoop:Big Data Sulution 2. Introduction Introduction to Hadoop……………………………………………………………………. Hadoop……………………………………………………………………. -3-7History Introduction Features Of Hadoop Hadoop Core Components How Does Hadoop Work 3. Hadoop 3. Hadoop Distributed File System ………………………………………… ……………………………………………………… ……………..-8-11..-8-11Features Of Hdfs Description Of Hdfs Goals Of Hdfs Design of Hdfs 4. Mapreduce………………………………………………………………………… Mapreduce…………………………………………………………………………..….…-12-14….…-12-14The Algorithm 5. Hadoop Ecosystem………………………… Ecosystem…………………………………………………… ……………………………………………. ………………….-13-20-13-206. Advantages Of Hadoop………………………………………………………………….. Hadoop…………………………………………………………………..-21-22-21-227. Applications of Hadoop………………………………… Hadoop…………………………………………………………… …………………………… ….....-23-25When Not To Use Hadoop Disadvantages Of Hadoop 8.Future Perspective………………………… Perspective…………………………………………………… ………………………………………………… ………………………-26-269.Research 9.Research Papers……………………………… Papers…………………………………………………………… ……………………………………………... ………………....-27.-2710.Hadoop Research Topic…..……………………………… Topic…..…………………………………………………………… ………………………………... …...-28-2811.Case study…………………………… study……………………………………………………… …………………………..………………………… ..…………………………-29-2912.References………………………… 12.References…………………………………………………… ……………………………………………………… …………………………….. ..-30-30-
III
List of Figure Figure 1.1: Growth of big data ……………………………………………………………1
Figure 1.2: Solving problem of big data by Hadoop………………………………………2
Figure 2.1: Hadoop Features……………………………………………………………….3
Figure 2.2: Hadoop Feature………………………………………………………………..5
Figure 2.3: hadoop part…………………………………………………………………….7
Figure 3.1: HDFS architecture……………………………………………………………. .8
Figure 4.1.MapReduce…………………………………………………………………… .13
Figure 5.1: Hadoop ecosystem…………………………………………………………….16
IV
ACKNOWLEDGEMENT
I would like to express my gratitude to Lord Almighty, the most Beneficent and the most Merciful, for completion of this seminar report. I wish to thank my parents for their continuing support and encouragement. I also wish to thank them for providing me with the opportunity to reach this far in my studies. I wou ld like to thank particularly our Supervisor Mr. Ramesh Kumar for his patience, support and encouragement throughout the completion of this seminar report and having faith in us. At last but not the least I am greatly indebted to all other persons who directly or indirectly helped me during this work.
V
REASON TO HADOOP Due to the advent of new technologies, devices, and communication means like social networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data in the form of disks it may fill an entire football field. The same amount was created in every two days in 2011, and in every ten minutes in 2013. This rate is still growing enormously which led big data.
1. What is Big Data? Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it involves many areas of business and technology. Thus Big Data includes huge volume, high velocity, and extensible variety of data.
Figure 1.1: Growth of big data Source: www.edureka.co/blog/hadooptutorial
2. Big Data Challenges The major challenges associated with big data are as follows: 1. Capturing data- 1 2. Storage -1-
3. Searching 4. Sharing 5. Transfer
3. Google’s Solution Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset.
Figure 1.2: Solving problem of big data by Hadoop Source: www.edureka.co/blog/hadooptutorial
4. Hadoop :Big Data solution Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data. To solve the storage issue and processing issue, two core components were created in Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a distributed fashion and is easily scalable. And, YARN solves the processing issue by reducing the processing time drastically. -2-
INTRODUCTION TO HADOOP 1. History According to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the "Google File System" paper that was published in October 2003. This paper spawned another one from Google – "MapReduce: Simplified Data Processing on Large Clusters". Development started on the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of code for HDFS and about 6,000 lines of code for MapReduce.
2. Introduction Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license. Hadoop was developed, based on the paper written by Google on MapReduce system and it applies concepts of functional programming. Hadoop is written in the Java programming language and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.
Figure 2.1: Hadoop Features Source: www.edureka.co/blog/hadooptutorial
-3-
3. Features of hadoop Reliability: When machines are working in tandem, if one of the machines fails, another machine will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable. Economical: Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 510 TB hard disk and Xeon processors, but if I would have used hardware-based RAID with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop environment and is economical as well. Also, Hadoop is an open source software and hence there is no licensing cost. Scalability: Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because you can go ahead and procure more hardware and expand your setup within minutes whenever required. Flexibility: Hadoop is very flexible in terms of ability to deal with all kinds of data. We discussed “Variety” in our previous blog on Big Data Tutorial, where data can be of any kind and Hadoop can store and process them all, whether it is structured, semi-structured or unstructured data. These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges.
4. Hadoop Core Components One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN is used for
-4-
process
the
data
i.e. stored
in
the
HDFS
in
a
distributed
and
parallel
fashion
Figure 2.2: Hadoop Feature Source: www.edureka.co/blog/hadooptutorial
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
sHadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications; and
Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program
-5-
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
5. How Does Hadoop Work? It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines. Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs:
Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
-6-
Figure 2.3: hadoop parts Source: www.edureka.co/blog/hadooptutorial Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items: 1. The location of the input and output files in the distributed file system. 2. The java classes in the form of jar file containing the implementation of map and reduce functions. 3. The job configuration by setting different parameters specific to the job.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files on the file system.
-7-
Hadoop Distributed File System (HDFS) Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault-tolerant and designed using low-cost hardware. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
1. Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data. HDFS provides file permissions and authentication.
Figure 3.1: HDFS architecture Source: www.edureka.co/blog/hadooptutorial
-8-
2. Description of featuresNameNode
It is the master daemon that maintains and mana ges the DataNodes (slave nodes)
It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live
It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
DataNode
It is the slave daemon which run on each slave machine
The actual data is stored on DataNodes
It is responsible for serving read and write requests from the clients
It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode
It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds
Block Generally the user data is stored in the files of HDFS. Th e file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration. In HDFS, Namenode is a master node and Datanodes are slaves. Namenode contains the metadata about the data stored in Data nodes, like which data block is stored in which data node, where are the replications of the data block kept etc. The actual data is stored in Data -9-
Nodes. I also want to add, we actually replicate the data blocks present in Data Nodes, and by default, the replication factor is 3. Since we are using commodity hardware and we know the failure rate of these hardwares are pretty high, so if one of the DataNodes fails, HDFS will still have the copy of those lost data blocks. That’s the reason we need to replicate the data block. You can configure replication factor based on your requirements.
3. Goals of HDFS [1] . Fault detection and recovery: Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. [2]. Huge datasets: HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. [3]. Hardware at data: A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces thenetwork traffic and increases the throughput.
4. The Design of HDFS HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. Let’s examine this statement in more detail: Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. Commodity hardware
- 10 -
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware (commonly available hardware available from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
- 11 -
MapReduce MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model. MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment. MapReduce consists of two distinct tasks - Map and Reduce.
As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
So, the first is the map job, where a block of data is read and processed to produce keyvalue pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.
- 12 -
1. The Algorithm Generally MapReduce paradigm is based on sending the computer to where the data resides! MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic.After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
Figure:4.1.MapReduce Source: www.edureka.co/blog/hadooptutorial - 13 -
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.
In a MapReduce program, Map() and Reduce() are two functions.
The Map function performs actions like filtering, grouping and sorting.
While Reduce function aggregates and summarizes the result produced by map function.
The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.
- 14 -
HADOOP ECOSYSTEM
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it.
HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr & Lucene -> Searching & Indexing
Ambari -> Provision, Monitor and Maintain cluster
- 15 -
Figure :5.1:hadoop ecosystem Source: www.edureka.co/blog/hadooptutorial
YARN Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by allocating resources and scheduling tasks.
It has two major components, i.e. ResourceManager and NodeManager.
ResourceManager
It is a cluster level (one for each cluster) component and runs on the master machine
It manages resources and schedule applications running on top of YARN
It has two components: Scheduler & ApplicationManager
The Scheduler is responsible for allocating resources to the various runnin g applications
The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application
It keeps a track of the heartbeats from the Node Manager
- 16 -
NodeManager
It is a node level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each container
It also keeps track of node health and log management
It continuously communicates with ResourceManager to remain up -to-date
Apache Pig 1. Pig is made up of two pieces: •
The language used to express data flows, called Pig Latin.
•
The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster. It Supports SQL like command structure. It Gives a platform for building data flow for ETL (Extract, Transform and Load), processing and analyzing huge data sets.
•
10 line of pig latin = approx. 200 lines of Map-Reduce Java code.
•
The compiler internally converts pig latin to MapReduce. It produces a sequential set of MapReduce jobs, and that’s an abstraction (which works like black box).
•
PIG was initially developed by Yahoo.
2. Pig isn’t suitable for all data processing tasks, however. Like MapReduce, it is designed for batch processing of data. If you want to perform a query that touches only a small amount of data in a large dataset, then Pig will not perform well, since it is set up to scan the whole dataset, or at least large portions of it. 3.
Pig has two execution types or modes: local mode and Hadoop mode. Local mode
In local mode, Pig runs in a single JVM and accesses the local file system. This mode is suitable only for small datasets
Hadoop mode
- 17 -
In Hadoop mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster. Hadoop mode (with a fully distributed cluster) is what you use when you want to run Pig on large datasets.
APACHE HIVE
Face book created HIVE .Data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface. HIVE + SQL = HQL
The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is used to establish connection from data storage.
Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set processing (i.e. Batch query processing) and real time processing (i.e. Interactive query processing).
It supports all primitive data types of SQL.
MAHOUT Renowned for machine learning. Provides an environment for creating machine learning applications which are scalable. Machine learning algorithms allow us to build self-learning machines that evolve by itself without being explicitly programmed. Based on user behavior, data patterns and past experiences it makes important future decisions. It a descendant of Artificial Intelligence (AI). What Mahout does?
- 18 -
Performs collaborative filtering, clustering and classification. Some people also consider frequent item set missing as Mahout’s function. Let us understand them individuall y:
Apache Spark It is a framework for real time data analytics in a d istributed computing environment.
Written in Scala and was originally developed at the University of California, Berkeley.
It executes in-memory computations to increase speed of data processing over MapReduce.
It is 100x faster than Hadoop for large scale data. Therefore, it requires high processing power than Map-Reduce
HBase
HBase is an open source, non-relational distributed database. It is a NoSQL database. It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets.
Designed to run on top of HDFS and provides BigTable like capabilities. It gives us a fault tolerant way of storing sparse data, which is common in most Big Data use cases. The HBase is written in Java.
AMBARI
Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more manageable.
The Ambari provides:
Hadoop cluster provisioning
Hadoop cluster management
Hadoop cluster monitoring
Apache Zookeeper - 19 -
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various services in a Hadoop Ecosystem. Apache Zookeeper coordinates with various services in a distributed environment. Before Zookeeper, it was very difficult and time consuming to coordinate between different services in Hadoop Ecosystem. The services earlier had many problems with interactions like common configuration while synchronizing data. Even if the services are configured, changes in the configurations of the services make it complex and difficult to handle. The grouping and naming was also a time-consuming factor. Due to the above problems, Zookeeper was introduced. It saves a lot of time by performing synchronization, configuration maintenance, grouping and naming.
Apache Oozie
Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as one logical work. There are two kinds of Oozie jobs:
Oozie workflow: These are sequential set of actions to be executed. You can assume it as a relay race. Where each athlete waits for the last one to complete his part.
Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made available to it. Think of this as the response-stimuli system in our body. In the same manner as we respond to an external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.
Apache Flume Ingesting data is an important part of our Hadoo p Ecosystem.
The Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting, aggregating and moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS. - 20 -
Advantages of Hadoop
1. Scalable Hadoop is a highly scalable storage platform, because it can stores and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving many thousands of terabytes of data. 2. Cost effective Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep. While this approach may have worked in the short term, this meant that when business priorities changed, the complete raw data set was not available, as it was too expensive to store. 3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations. Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign an alysis and fraud detection. 4. Fast
Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours. 5.Resilient to failure
- 21 -
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.
- 22 -
Application of Hadoop
Hadoop in Healthcare Sector
Hadoop for Telecom Industry
Hadoop in Retail Sector
Hadoop in the Financial Sector
1. How Is Hadoop Being Used?
Low-cost storage and data archive
The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later.
Sandbox for discovery and analysis
Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently, uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.
Data lake
Data lakes support storing data in its original or exact format. The goal is to offer a raw or unrefined view of data to data scientists and analysts for discovery and analytics. It helps them ask new or difficult questions without constraints. Data lakes are not a replacement for data warehouses. In fact, how to secure and govern data lakes is a huge topic for IT. They may rely on data federation techniques to create a logical data structures.
Complement your data warehouse
We're now seeing Hadoop beginning to sit beside data warehouse environments, as well as certain data sets being offloaded from the data warehouse into Hadoop or new types of - 23 -
data going directly to Hadoop. The end goal for every organization is to have a right platform for storing and processing data of different schema, formats, etc. to support different use cases that can be integrated at different levels.
IoT and Hadoop
Things in the IoT need to know what to communicate and when to act. At the core of the IoT is a streaming, always on torrent of data. Hadoop is often used as the data store for millions or billions of transactions. Massive storage and processing capabilities also allow you to use Hadoop as a sandbox for discovery and definition of patterns to be monitored for prescriptive instruction. You can then continuously improve these instructions, because Hadoop is constantly being updated with new data that doesn’t match previously defined patterns.
2. When to not to use Hadoop ? Following are some of those scenarios :
Low Latency data access : Quick access to small parts of data
Multiple data modification : Hadoop is a better fit only if we are primarily concerned about reading data and not writing data.
Lots of small files : Hadoop is a better fit in scenarios, where w e have few but large files.
3. Disadvantages of hadoop As the backbone of so many implementations, Hadoop is almost synomous with big data. 1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example can be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If whoever managing the platform lacks of know how to enable it, your data could be at huge risk. Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer to keep their data under wraps. 2. Vulnerable By Nature
- 24 -
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the most widely used yet controversial programming languages in existence. Java has been heavily exploited by cybercriminals and as a result, implicated in numerous security breaches. 3. Not Fit for Small Data
While big data is not exclusively made for big businesses, not all big data platforms are suited for small data needs. Unfortunately, Hadoop happens to be one of them. Due to its high capacity design, the Hadoop Distributed File System, lacks the ability to efficiently support the random reading of small files. As a result, it is not recommended for organizations with small quantities of data. 4. Potential Stability Issues
Like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems. 5. General Limitations
The article intr oducesApache Flume, MillWheel, and Google’s own Cloud Dataflow as possible solutions. What each of these platforms have in common is the ability to improve the efficiency and reliability of data collection, aggregation, and integration. The main point the article stresses is that companies could be missing out on big benefits by using Hadoop alone.
- 25 -
Future perspective Because of all this hassle in the industry, Imarticus Learning has created CBDH (certificate in big data and hadoop) program designed to ensure that you are job ready to take up assignments in Big Data Analytics using the Hadoop framework. This functional skill building program not only equips you with essential concepts of Hadoop but also gives you the required work experience in Big Data and Hadoop through the implementation of real-life industry projects. Since the data market forecast is strong and here to stay the knowledge of Hadoop and related technology will act as a career boost in India with its growing analytics market.
1. Other applications of hadoop The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can, in theory, be used for any sort of work that is batch-oriented rather than real-time, is very data-intensive, and benefits from parallel processing of data. It can also be used to complement a real-time system, such as lambda architecture, Apache Storm, Flink and Spark Streaming. As of October 2009, commercial applications of Hadoopincluded:
log and/or clickstream analysis of various kinds
marketing analytics
machine learning and/or sophisticated data mining
image processing
processing of XML messages
web crawling and/or text processing
general archiving, including of relational/tabular data, e.g. for compliance
- 26 -
Research Papers Some papers influenced the birth and growth of Hadoop and big data processing. Some of these are: 1. Jeffrey Dean, Sanjay Ghemawat (2004) MapReduce: Simplified Data Processing on Large Clusters, Google. This paper inspired Doug Cutting to develop an opensource implementation of the Map-Reduce framework. He named it Hadoop, after his son's toy elephant. 2. Michael Franklin, Alon Halevy, David Maier (2005) From Databases to Dataspaces: A New Abstraction for Information Management. The authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system's understanding of the data.
- 27 -
Hadoop Research Topics
Ability to make Hadoop scheduler resource aware, especially CPU, memory and IO resources. The current implementation is based on statically configured slots.
Abilty to make a map-reduce job take new input splits even after a map-reduce job has already started.
Ability to dynamically increase replicas of data in HDFS based on access patterns. This is needed to handle hot-spots of data.
Ability to extend the map-reduce framework to be able to process data that resides partly in memory. One assumption of the current implementation is that the map-reduce framework is used to scan data that resides on disk devices. But memory on commodity machines is becoming larger and larger. A cluster of 3000 machines with 64 GB each can keep about 200TB of data in memory! It would be nice if the hadoop framework can support caching the hot set of data on the RAM of the tasktracker machines. Performance should increase dramatically because it is costly to serialize/compress data from the disk into memory for every query.
Heuristics to efficiently 'speculate' map-reduce tasks to help work around machines that are laggards. In the cloud, the biggest challenge for fault tolerance is not to handle failures but rather anomalies that makes parts of the cloud slow (but not fail completely), these impact performance of jobs.
Make map-reduce jobs work across data centers. In many cases, a single hadoop cluster cannot fit into a single data center and a user has to partition the dataset into two hadoop clusters in two different data centers.
High Availability of the JobTracker. In the current implementation, if the JobTracker machine dies, then all currently running jobs fail.
Ability to create snapshots in HDFS. The primary use of these snapshots is to retrieve a dataset that was erroneously modified/deleted by a buggy application.
- 28 -
Case Studies
Hadoop Usage at Last.fm Last.fm: The Social Music Revolution
Founded in 2002, Last.fm is an Internet radio and music community website that offers many services to its users, such as free music streams and downloads, music and event recommendations, personalized charts, and much more. There are about 25 million people who use Last.fm every month, generating huge amounts of data that need to be processed. One example of this is users transmitting information indicating which songs they are listening to (this is known as “scrobbling”). This data is processed and stored by Last.fm, so the user can access it directly (in the form of charts), and it is also used to make decisions about users’ musical tastes and compatibility, and artist and track similarity. Hadoop at Last.fm
As Last.fm’s service developed and the number of users grew from thousands to millions, storing, processing and managing all the incoming data became increasingly challenging. Fortunately, Hadoop was quickly becoming stable enough and was enthusiastically adopted as it became clear how many problems it solved. It was first used at Last.fm in early 2006 and was put into production a few months later. There were several reasons for adopting Hadoop at Last.fm: • The distributed filesystem provided redundant backups for the data stored on it (e.g., web logs, user listening data) at no extra cost. • Scalability was simplified through the ability to add cheap, commodity hardware when required. • The cost was right (free) at a time when Last.fm had limited financial resources. • The open source code and active community meant that Last.fm could freely modify Hadoop to add custom features and patches. Hadoop provided a flexible framework for running distributed computing algorithms with a relatively easy learning curve. Hadoop has now become a crucial part of Last.fm’s infrastructure, currently consisting of two Hadoop clusters spanning over 50 machines, 300 cores, and 100 TB of disk space. Hundreds of daily jobs are run on the clusters performing operations, such as logfile analysis, evaluation of A/B tests, ad hoc processing, and charts generation. - 29 -