Analysis of Log file using Hadoop A Project report Submitted in fulfilment of the requirement for the award of the degree of BACHELOR OF TECHNOLOGY BY
L.Rama Narayana Reddy L.
13VD1A0532
V.Tejasi
13VD1A055!
" "..#nigda
13VD1A05!$
Under the guidance of Dr. %. #&a&u '&a(rapa(i , Asst. Professor and HOD,
Dept. of omputer Science and !ngineering, "#$UH ollege of !ngineering %anthani
D)"ART*)NT +, '+*"-T)R #')N') AND )N/N))RN/ 0A1AHARLAL N)HR- T)'HN+L+/.'AL -N.V)R#.T2 H2D)RA3AD
'+LL)/) +, )N/.N))R.N/ *ANTHAN.
Pannur &'il(, )amagiri &%dl(, Peddapall*+-/, $elangana &0ndia(. -/1+-/2
0A1AHARLAL N)HR- T)'HN+L+/.'AL -N.V)R#.T2 H2D)RA3AD
'+LL)/) +, )N/.N))R.N/ *ANTHAN.
D)"ART*)NT +, '+*"-T)R #')N') AND )N/N))RN/
D)'LARAT+N TH) 'ANDDAT)
3e, L.Rama L.Rama Narayana Narayana Reddy Reddy413VD1A0532, V.Te Tejasi413VD1A jasi413VD1A055! 055! and ".#nigda 413VD1A05!$ here b* certif* that the project report entitled 4Analysis of Log file using
ndeer the gui uid danc ncee of Dr ssist stant ant Pr Profe ofesso ssorr in Hadoop5 und Dr.. %. #& #&a&u a&u '&a '&a(r (rapa apa(&i (&i, Assi Department of omputer Science and !ngineering, "#$UH ollege of !ngineering %anthani submitted submitte d b* in partial fulfillment fulfillment for the award of the Degree of 6achelor of $echnolog* $echnolog* in omputer Science and !ngineering $his is a record of bonafide wor7 carried out b* us and the results embodied in this project report ha8e not been reproduced or copied from an* an * source. sou rce. $he results embodied in this project ha8e not been submitted to an* other Uni8ersit* or 0nstitute for the award of an* degree or diploma. L.Rama Narayana Reddy 413VD1A032 V.T .Tejasi ejasi 413VD1A055! 413VD1A055! ".#nigda 413VD1A05!$
2
'+LL)/) +, )N/.N))R.N/ *ANTHAN.
D)"ART*)NT +, '+*"-T)R #')N') AND )N/N))RN/
D)'LARAT+N TH) 'ANDDAT)
3e, L.Rama L.Rama Narayana Narayana Reddy Reddy413VD1A0532, V.Te Tejasi413VD1A jasi413VD1A055! 055! and ".#nigda 413VD1A05!$ here b* certif* that the project report entitled 4Analysis of Log file using
ndeer the gui uid danc ncee of Dr ssist stant ant Pr Profe ofesso ssorr in Hadoop5 und Dr.. %. #& #&a&u a&u '&a '&a(r (rapa apa(&i (&i, Assi Department of omputer Science and !ngineering, "#$UH ollege of !ngineering %anthani submitted submitte d b* in partial fulfillment fulfillment for the award of the Degree of 6achelor of $echnolog* $echnolog* in omputer Science and !ngineering $his is a record of bonafide wor7 carried out b* us and the results embodied in this project report ha8e not been reproduced or copied from an* an * source. sou rce. $he results embodied in this project ha8e not been submitted to an* other Uni8ersit* or 0nstitute for the award of an* degree or diploma. L.Rama Narayana Reddy 413VD1A032 V.T .Tejasi ejasi 413VD1A055! 413VD1A055! ".#nigda 413VD1A05!$
2
0A1AHARLAL N)HR- T)'HN+L+/.'AL -N.V)R#.T2 H2D)RA3AD '+LL)/) +, )N/.N))R.N/ *ANTHAN.
D)"ART*)NT +, '+*"-T)R #')N') AND )N/N))RN/
')RT,'AT) ,R+* A'AD)*' ADV#+R $his $h is is to certi certif* f* that that the project project report report entitle entitled d 4Ana Analys lysis is of Log ,ile us using ing Hadoop5,
being
submitted
b*
L.Rama
Narayana
Reddy413VD1A05326
V.Tejasi413VD1A055! and ".#nigda 413VD1A05!$ in the fulfillment for the award of
the Degree of A'H)L+R +, T)'HN+L+/ in omputer omputer Science and !ngineering to the A AAHARLAL HARLAL N)HRN)HR- T)'HN+L+ T)'HN+L+/'AL /'AL -NV)R# -NV)R#T T HD)RAA HD)RAAD D '+LL)/) +, )N/N))RN/ *ANTHAN is a record of bonafide wor7 carried out b*
them under m* guidance and super8ision. $he results of in8estigation enclosed in this report ha8e been 8erified and found satisfactor*. $he results embodied in this project report ha8e not been submitted to an* other Uni8ersit* or 0nstitute for the award of an* degree or diploma.
Dr.%.#&a&u '&a(rapa(i Head of (&e Depar(men(
0A1AHARLAL N)HR- T)'HN+L+/.'AL -N.V)R#.T2 H2D)RA3AD
3
'+LL)/) +, )N/.N))R.N/ *ANTHAN.
D)"ART*)NT +, '+*"-T)R #')N') AND )N/N))RN/
')RT,'AT) ,R+* H)AD +, TH) D)"ART*)NT
$his is to certif* that the project report entitled 4Analysis of Log ,ile using Hadoop5,
being
submitted
b*
L.Rama
Narayana
Reddy413VD1A0532 ,
V.Tejasi413VD1A055! and ".#nigda 413VD1A05!$ in the fulfillment for the award of
the Degree of a7&elor of Te7&nology in omputer Science and !ngineering to the AAHARLAL
N)HR-
T)'HN+L+/'AL
-NV)R#T
HD)RAAD
'+LL)/) +, )N/N))RN/ *ANTHAN is a record of bonafide wor7 carried out b*
them under m* guidance and super8ision. $he results embodied in this project report ha8e not been submitted to an* other Uni8ersit* or 0nstitute for the award of an* degree or diploma.
Dr.%.#&a&u '&a(rapa(i Head of (&e Depar(men(
Date9 )8(ernal )8aminer
4
A'%N+L)D/*)NT
3e e:press our sincere gratitude to "rof. #ri Dr.*ar9andeya '&ary6 "rin7ipal6 "#$UH ollege of !ngineering %anthani for encouraging and gi8ing permission to accomplish our project successfull*. 3e e:press our sincere gratitude to Dr. Vis&nu Vard&an6 Vi7e "rin7ipa l, "#$UH ollege of !ngineering %anthani for his e:cellent guidance, ad8ice and encouragement in ta7ing up this project. 3e e:press our profound gratitude and than7s to our project guide Dr. %. #&a&u '&a(rapa(i6 H+D6 '#) Depar(men( for his constant help, personal super8ision, e:pert
guidance and consistent encouragement throughout this project which enabled us to complete our project successfull* in time. 3e also ta7e this opportunit* to than7 other facult* members of S! Department for their 7ind co+operation. 3e wish to con8e* our than7s to one and all those who ha8e e:tended their helping hands directl* and indirectl* in completion of our project.
L.Rama Narayana Reddy 413VD1A032 V.Tejasi 413VD1A055! ".#nigda 413VD1A05!$
5
Na(ional nforma(i7s 'en(re #ational 0nformatics entre ( was established in /;21, and has since emerged as a
e+=o8ernance applications up to the grassroots le8el as well as a promoter of digital opportunities for sustainable de8elopment. #0, through its 0$ #etwor7, <#0#!$<, has institutional lin7ages with all the %inistries Departments of the entral =o8ernment, ?1 State =o8ernments> Union $erritories, and about 1@@ District administrations of 0ndia. #0 has been instrumental in steering e+=o8ernment>e+=o8ernance applications in go8ernment ministries>departments at the entre, States, Districts and 6loc7s, facilitating
impro8ement
in
go8ernment
ser8ices,
wider
transparenc*,
promoting
decentralied planning and management, resulting in better efficienc* and accountabilit* to the people of 0ndia. <0nformatics+led+de8elopment< program of the go8ernment has been spearheaded b* #0 to deri8e competiti8e ad8antage b* implementing 0$ applications in social B public administration. $he following major acti8ities are being underta7en9
● ● ● ● ● ●
Setting up of 0$ 0nfrastructure 0mplementation of #ational and State Ce8el e+=o8ernance Projects Products and Ser8ices onsultanc* to the go8ernment departments )esearch and De8elopment apacit* 6uilding During the last three decades, #0 has implemented man*
6
)ecords and Propert* registration, ulture B $ourism, 0mport B !:ports facilitation, Social 3elfare Ser8ices, %icro+le8el Planning, etc. 3ith increasing awareness leading to demand and a8ailabilit* of 0$ infrastructure with better capacities and programme framewor7, the go8ernance space in the countr* witnessed a new round of projects and products, co8ering the entire spectrum of e+=o8ernance including =, =6, ==, with emphasis on ser8ice deli8er*. #0 pro8ides #ationwide ommon 0$ 0nfrastructure to support e+=o8ernance ser8ices to the citien, Products and Solutions designed to address e+=o8ernance 0nitiati8es, %ajor e+ =o8ernance Projects, State>U$ 0nformatics Support and district le8el ser8ices rendered. #0 has set up state+of+the+art 0$ infrastructure consisting of #ational and state Data entreEs to manage the information s*stems and websites of entral %inistries>Departments, Disaster )eco8er* entreEs, #etwor7 Operations facilit* to manage heterogeneous networ7s spread across 6hawans, States and Districts, ertif*ing Authorit*, 'ideo+onferencing and capacit* building across the countr*. #ational Fnowledge #etwor7 F#( has been set up to connect institutions>organiations carr*ing out research and de8elopment, Higher !ducation and =o8ernance with speed of the order of multi =igabits per second. urther, State =o8ernment secretariats are connected to the entral =o8ernment b* 8er* high speed lin7s on Optical iber able &O(. Districts are connected to respecti8e State capitals through leased lines. 'arious initiati8es li7e =o8ernment eProcurement S*stem&=eP#0(, Office %anagement Software &eOffice(, Hospital %anagement S*stem &eHospital(, =o8ernment inancial Accounting 0nformation S*stem &eCe7ha(, etc. ha8e been ta7en up which are replicable in 8arious =o8ernment organiations. As #0 is supporting a majorit* of the mission mode e+=o8ernance projects, the chapter on #ational e+=o8ernance Projects lists the of details of these projects namel* #ational Cand )ecords %oderniation Programme C)%P(, $ransport and #ational )egistr*, $reasur* omputeriation, 'A$, %=+#)!=A, 0ndia+Portal, e+ourts, Postal Cife 0nsurance, etc. #0 also la*s framewor7 and designs s*stems for online monitoring of almost all central go8ernment schemes li7e 0ntegrated 3atershed %anagement &03%P(, 0AG, S=SG, #SAP, 6)=, Schedule $ribes and other $raditional orest Dwellers Act etc. 0$ support is also being pro8ided in the States > U$s b* #0. itien centric ser8ices are also being rendered electronicall* at the district le8el, such as 0ncome ertificate, aste ertificate, and
7
)esidence ertificate etc. along with other ser8ices li7e Scholarship portals, permits, passes, licenses to name a few. 0n e:ecuting all these acti8ities, #0 has been gi8en recognition in terms of awards and accolades in 0nternational as well as #ational le8els, which are listed in the Awards Section. $hus, #0, a small program started b* the e:ternal stimulus of an U#DP project, in the earl* /;2-s, became full* functional in /;22 and since then has grown with tremendous momentum to become one of 0ndias major SB$I organiations promoting informatics led de8elopment.
8
A#TRA'T
9
A#TRA'T: 0n toda*Es 0nternet world Cogs are an essential part of an* computing s*stem, supporting capabilities from audits to error management, As logs grow and the number of log sources increases &such as in cloud en8ironments(, a scalable s*stem is necessar* to efficientl* process logs.log file anal*sis is becoming a necessar* tas7 for anal*ing the customerEs 6eha8ior in order to impro8e sales as well as for datasets li7e en8ironment, science, social networ7, medical, ban7ing s*stem it is important to anal*e the log data to get required 7nowledge from it. 3eb mining is the process of disco8ering the 7nowledge from the web data. Cog files are getting generated 8er* fast at the rate of /+/- %b>s per machine, a single data center can generate tens of terab*tes of log data in a da*. $hese datasets are huge. 0n order to anal*e such large datasets, we need parallel processing s*stem and reliable data storage mechanism. 'irtual database s*stem is an effecti8e solution for integrating the data but it becomes inefficient for large datasets. $he Hadoop framewor7 pro8ides reliable data storage b* Hadoop Distributed ile S*stem and %ap)educe programming model which is a parallel processing s*stem for large datasets. Hadoop distributed file s*stem brea7s up input data and sends fractions of the original data to se8eralmachines in Hadoop cluster to hold bloc7s of data. $his mechanism helps to process log data in parallel using all the machines in the Hadoop cluster and computes result efficientl*. $he dominant approach pro8ided b* Hadoop to 4Store first quer* later5, loads the data to the Hadoop Distributed ile S*stem and then e:ecutes queries written in Pig Catin. $his approach reduces the response time as well as the load on to the end s*stem. Cog files are primar* source of information for identif*ing the S*stem threats and problems that occur in the S*stem at an* point of time. $hese threats and problem in the s*stem can be identified b* anal*ing the log file and finding the patterns for possible suspicious beha8ior. $he concern administrator can then be pro8ided with appropriate alter or warning regarding these securit* threats and problems in the s*stem, which are generated after the log files are anal*ed. 6ased upon this alters or warnings the administrator can ta7e appropriate actions. %an* tools or approaches are a8ailable for this purpose, some are proprietar* and some are open source
10
'+NT)NT#
"A/) N+
1. NTR+D-'T+N
/./ 0ntroduction to project /. !:isting S*stem /.? Proposed S*stem /.J S*stem )equirements /.J./ Software )equirements /.J. Hardware )equirements /. %odules /.1 Process Diagram 2. LT)RAT-R) #-RV) 3. ##T)* ANAL##
?./ !:isting S*stem ?. Proposed S*stem ?.? easibilit* Stud* ?.?./
!conomical easibilit*
?.?.
$echnical easibilit*
?.?.?
Social easibilit*
!. ##T)* R);-R)*)NT# #")','AT+N#
J./ 0ntroduction
J. #on+unctional )equirements
J.? S*stem )equirements
5. ##T)* D)#/N
25
./ 0ntroduction
. High+le8el design
.? Cow+le8el design
11
.?./
U%C Diagrams
<. '+DN/
$. T)#TN/
2./ $*pes Of $esting 2. $est Strateg* and Approach
?-
2.? $est ases
?
?J
=. #'R))N#H+T# >. '+N'L-#+N 10.L+/RA"H
12
1. NTR+D-'T+N
13
1. NTR+D-'T+N: Apa7&e Hadoopis an open+source software framewor7 written in ja8a for distributed
storage and distributed processing of 8er* large data sets on computer clusters built fromcommodit* hardware. All the modules in hadoop are designed with a fundamental assumption thathardware failures are common and should be automaticall* handled b* the framewor7.Hadoop framewor7 includes following four modules9
•
Hadoop 'ommon: $hese are "a8a libraries and utilities required b* other
Hadoopmodules. $hese libraries pro8ides file s*stem and OS le8el abstractions and contains the necessar* "a8a files and scripts required to start Hadoop. •
Hadoop ARN: $his is a framewor7 for job scheduling and cluster resource
management. •
Hadoop Dis(ri?u(ed ,ile #ys(em 4HD,#: A distributed file s*stem that pro8ides
highthroughput access to application data. •
Hadoop *apRedu7e: $his is GA)#+based s*stem for parallel processing of large
datasets. $he Hadoop distributed file s*stem &HDS( is a distributed, scalable, and portable files s*stem.HDS stores large files t*picall* in the range of gigab*tes, terab*tes, and petab*tes across multiple machines. HDS uses a master>sla8e architecture where master consists of a single Name Node that manages the file s*stem metadata and one or more sla8e Da(aNodes that store the actual data.
1.1 ,ea(ures of HD,#: /. 0t is suitable for the distributed storage and processing. . Hadoop pro8ides a command interface to interact with HDS. ?. $he built+in ser8ers of namenode and datanode help users to easil* chec7 the status ofcluster. J. Streaming access to file s*stem data. . HDS pro8ides file permissions and authentication.
14
HD,# Ar7&i(e7(ure:
1.2 )8is(ing #ys(em: $he current processing of log files goes through ordinar* sequential wa*s in order to perform
preprocessing,
session
identification
and
user
identification.
$he
non+
Hadoopapproach loads the log file dataset, to process each line one after another. $he log field is then identified b* splitting the data and b* storing it in an arra* list. $he preprocessed logfield is stored in the form of hash table, with 7e* and 8alue pairs, where 7e* is the month and 8alue is the integer representing the month. 0n e:isting s*stem wor7 is possible to run onl* on single computer with a single ja8a 8irtual machine &"'%(. A "'% has the abilit* to handle a dataset based on )A% i.e. if the )A% is of =6 then a "'% can process dataset of onl* /=6. Processing of log files greater than /=6 becomes hectic. $he non+Hadoop approach is performed on ja8a /.1 with single "'%. Although batch processing can be found in these single+processor programs, there are problems in processing due to limited capabilities. $herefore, it is necessar* to use parallel processing approach to wor7effecti8el* on massi8e amount of large datasets.
Disad@an(ages: $he problem with traditional management s*stems is that is e:tremel* cost prohibiti8e to scale to such a degree in order to process such massi8e 8olumes of data. 0t is difficult to store and process the large datasets in toda* technical world.
1.3 "roposed #ys(em: Proposed solution is to anal*e web log generated b* Apache 3eb Ser8er. $his is helpful for statistical anal*sis. $he sie of web log can range an*where from a few F6 to hundreds of =6. Proposed mechanism design solution that based on different dimensions such as timestamp, browser, and countr*. 6ased on these dimension, we can e:tract pattern and information out of these log and pro8ides 8ital bits of information. $he technologies used are Apache Hadoop framewor7, Apache flume etc.Use Hadoop luster &=en/(. ontent will be created b* multiple 3eb
15
ser8ers and logged in local hard discs. Proposed s*stem uses four node en8ironments where data is manuall* stored in localhard dis7 in local machine. $his log data will then be transferred to HDS using Pig Catin script. $his log data is processed b* %ap)educe to produce omma Separated 'alues i.e. S'.ind the areas where there e:ist errors or warnings in the ser8er. Also find the spammer 0Ps in the web application. $hen we use !:cel or similar software to produce statistical information and generate reports.
Ta?le 1: 'omparison ?e(een e8is(ing sys(em and proposed sys(em Feature
Existing System
Storage apacit* Processing Speed )eliabilit* Data A8ailabilit* Data Cocation Data Structure
Proposed system
Cess Slow Cess Cess entralied Pre+defined Structure
%ore ast %ore High Ph*sicall* Highl* Distributed Structured, Semi+structured or Unstructured
1.! #ys(em Reuiremen(s: 1.!.1 HARDAR) R);-R)*)NT#: •
"ro7essor Type
: 0ntel &an* 8ersion(
•
#peed
: /./ =HK
•
RA*
: J=6
•
Hard dis9
: - =6
1.!.2 #+,TAR) R);-R)*)NT#: •
+pera(ing #ys(em
: Ubuntu /J.-J
•
'oding Language
: "a8a
•
#7rip(ing Language
: Pig Catin Script
•
D)
: !clipse
•
e? #er@er
: $omcat
•
Da(a?ase
: HDS
16
1.5 *odules: 0mplementation is the stage of the project when the theoretical design is turned out into a wor7ings*stem. $hus it can be considered to be the most critical stage in achie8ing a successful new s*stemand in gi8ing the user, confidence that the new s*stem will wor7 and be effecti8e. $heimplementation stage in8ol8es careful planning, in8estigation of the e:isting s*stem and itEsconstraints on implementation, designing of methods to achie8e changeo8er and e8aluation ofchangeo8er methods.
1.5.1 Num?er of *odules:
After careful anal*sis the s*stem has been identified to ha8e the following modules9 •
reating Pig Catin Script
•
Coading data into HDS using Pig Catin script
•
Anal*ing the dataset
1.5.2"ro7ess Diagrams:
17
2. LT)RAT-R) #-RV)
18
2. LT)RAT-R) #-RV): 6ig data is a collection of large datasets that cannot be processed using traditional computing techniques. 6ig Data includes huge 8olume, high 8elocit*, and e:tensible 8ariet* of data. $his data will be of three t*pes. •
Structured data9 )elational data.
•
Semi Structured data9 L%C data.
•
Unstructured data9 3ord, PD, $e:t, %edia Cogs.
Hadoop is an Apache open source framewor7 written in ja8a that allows distributed
processing of large datasets across clusters of computers using simple programming models and is de8eloped under open source license. 0t enables applications to wor7 with thousands of nodes and petab*tes of data. Hadoop framewor7 includes four modules+ Hadoop common, Hadoop *arn, Hadoop Distributed ile S*stem &HDS(, Hadoop %ap)educe.$he two major pieces of Hadoop includes HDS and %ap)educe
#T)"1: 0nstalling "a8a.
6ecome the super user and gi8e the following command9 # java -version
0f ja8a is present then the Output is as follows9
19
Java version "1.8.0_66" Java(TM) SE Runtime Environment (bui! 1.8.0_66-b1) Java ots$ot(TM) 6%-&it Server 'M (bui! .66-b1* mi+e! mo!e)
0f output is not as abo8e then install ja8a b* following command # su!o ,um insta java-1..0-o$enj!
$o 8erif* whether ja8a is installed or not we use the following command. java/
#T)"2: reating Hadoop User
reate a user account named Hadoop and add password to it using these commands. # a!!usera!oo$ # $ass!a!oo$
=enerating 7e* based ssh to its own account. # su - a!oo$ ss-e,2en -t rsa /at 34.ss4i!_rsa.$ub 55 34.ss4autorie!_e,s /mo! 0600 34.ss4autorie!_e,s ss o/aost e+it #T)"3: ns(all Hadoop:
/. lic7 here to download the "a8a @ Pac7age. Sa8e this file in *our home director*. . !:tract the "a8a $ar ile. ommand9 tar +:8f jd7+@u/-/+linu:+i@1.tar.g Untar "a8a + 0nstall Hadoop + !dure7a ?. Download the Hadoop .2.? Pac7age. ommand9 we get https9>>archi8e.apache.org>dist>hadoop>core>hadoop+.2.?>hadoop+.2.?.tar.g Download Hadoop Pac7age + 0nstall Hadoop + !dure7a J. !:tract the Hadoop tar ile.
20
ommand9 tar +:8f hadoop+.2.?.tar.g !:tract Hadoop Pac7age + 0nstall Hadoop M!dure7a . onfigure Hadoop Pseudo+Distributed %ode &a(. Setup !n8ironment 'ariables9 open N>.bashrc and append following
e+$ort 799:_9ME;4ome4a!oo$4a!oo$ e+$ort 799:_<=ST7>>;799:_9ME /2 e+$ort 799:_M7:RE_9ME;799:_9ME e+$ort 799:_?9MM9=_9ME;799:_9ME e+$ort 799:_@S_9ME;799:_9ME e+$ort A7R=_9ME;799:_9ME e+$ort 799: ?9MM9=_><&_=7T<'E_
Appl* the changes in current running en8ironment sour/e 34.basr/ #T)"!9 #ow set ja8a path in hadoop+en8.sh using 8i+editor in etc. folder
e+$ort J7'7_9ME;4usr4ib4jvm4java-1..0-o$enj!-1..0.C-.6.%.0.e_.+86_6%4jre
&b(.!dit onfiguration iles9 #a8igate to below location
/! 799:_9ME4et/4a!oo$
#ow append these :ml files
vi /ore-site.+m
configuration propert* namefs.default.name>name 8aluehdfs9>>localhost9;--->8alue >propert*
21
>configuration vi !Ds-site.+m
configuration propert* namedfs.replication>name 8alue/>8alue >propert* propert* namedfs.name.dir>name 8aluefile9>>>home>hadoop>hadoopdata>hdfs>namenode>8alue >propert* propert* namedfs.data.dir>name 8aluefile9>>>home>hadoop>hadoopdata>hdfs>datanode>8alue >propert* >configuration vi ma$re!-site.+m
configuration propert* namemapreduce.framewor7.name>name 8alue*arn>8alue >propert* >configuration vi ,arn-site.+m
configuration propert* name*arn.nodemanager.au:+ser8ices>name 8aluemapreduceQshuffle>8alue >propert* >configuration &c(.ormat #amenode9 =o to bin and appl* below command
!Ds nameno!e -Dormat
22
#T)"59 Start Hadoop cluster
$o start hadoop cluster, na8igate to *our hadoop sbin director* and e:ecute scripts one
•
b* one.
/! 799:_9ME4sbin4
)un start+all.sh to start hadoop
start-a.s
•
$o stop use the following command
sto$-a.s
#T)"<: =o to Hadoop home director* and format the #ame#ode. Command : cd Command : cd
hadoop-2.7.3
Command : bin/hadoop
namenode -format
$his formats the HDS 8ia #ame#ode. $his command is onl* e:ecuted for the first time. ormatting
the
file
s*stem
means
initialiing
the
director*
specified
b*
the
dfs.name.dir 8ariable. #e8er format, up and running Hadoop files*stem. Gou will lose all *our data stored in the HDS.
23
#T)"$: Once the #ame#ode is formatted, go to hadoop+.2.?>sbin director* and start all the
daemons. ?omman!B cd hadoop+.2.?>sbin
!ither *ou can start all daemons with a single command or do it indi8iduall*. ?omman!B ./ start+all.sh
$he abo8e command is a combination of start-!Ds.s* start-,arn.s B mr-jobistor,!aemon.s
Or *ou can run all the ser8ices indi8iduall* as below9 #(ar( NameNode:
$he #ame#ode is the centerpiece of an HDS file s*stem. 0t 7eeps the director* tree of all files stored in the HDS and trac7s all the file stored across the cluster. ?omman!B .>hadoop+daemon.sh start namenode #(ar( Da(aNode:
On startup, a Data#ode connects to the #amenode and it responds to the requests from the #amenode for different operations. ?omman!B .>hadoop+daemon.sh start datanode #(ar( Resour7e*anager:
)esource%anager is the master that arbitrates all the a8ailable cluster resources and thus helps in managing the distributed applications running on the GA)# s*stem. 0ts wor7 is to manage each #ode%anagers and the each applicationEs Application%aster. ?omman!B .>*arn+daemon.sh start resourcemanager #(ar( Node*anager:
$he #ode%anager in each machine framewor7 is the agent which is responsible for managing containers, monitoring their resource usage and reporting the same to the )esource%anager. ?omman!B .>*arn+daemon.sh start nodemanager #(ar( o?His(ory#er@er:
24
"obHistor*Ser8er is responsible for ser8icing all job histor* related requests from client. ?omman! : .4mr-jobistor,-!aemon.s start istor,server
&Or( ?omman!B .4start-a.s
$his command is used to start all the ser8ices at a time. $o stop all the ser8ices use the command .4sto$-a.s #T)"=: $o chec7 that all the Hadoop ser8ices are up and running, run the below command.
?omman!B jps
25
#T)">9 #ow open the %oilla browser and go to localhost:50070/dfshealth.html to
#T)">: Access Hadoop Ser8ices in 6rowser. •
Hadoop #ame#ode started on port --2- default.
tt$B44o/aostB0004 Hadoop Data#ode started on port --2 default. • tt$B44o/aostB004 Hadoop secondar*#ode started on port --;- default. • tt$B44o/aostB00C04 Access port @-@@ for getting the information about cluster and all applications. • tt$B44o/aostB80884 /;
N#TALLAT+N +, A"A'H) H?ase in u?un(u 1<.0!: Steps9 /. Download Hbase+/./. from apache site9 http9>>www.eu.apache.org>dist>hbase>/./.> . op* +paste hbase+/./.+bin.tar.g to *our home director* sa* >home>la77iredd*>edure7a
26
?. Untar the hbase+/./.+bin.tar.g tar file a. Open command prompt b. $*pe command9 5su!o tar -+D 4ome4aire!!,4e!urea4base-1.1.-bin.tar.2
J. reate director* 4hbase5 in >usr>lib a. $*pe ommand9 5 su!o m!ir 4usr4ib4base
. %o8e untar file 9 hbase+/./.to >usr>lib>hbase a. $*pe command9 5 su!o mv 4ome4aire!!,4e!urea4base-1.1. 4usr4ib4base
1. !dit hbase+site.:ml and hbase+en8.sh a. On command Prompt, run following commands b.
5 /! 4usr4ib4base4base-1.1.4/onD
c.
5 su!o 2e!it base-site.+m
d. cop* M paste below configuration into hbase+site.:ml configuration propert* namehbase.cluster.distributed>name 8aluetrue>8alue >propert* propert* namehbase9rootdir>name 8aluehdfs9>>localhost9;--->hbase>8alue >propert* >configuration e. sa8e and e:it geditor.
27
2. !dit hbase+en8.sh a. On command Prompt, run following commands b.
5 /! 4usr4ib4base4base-1.1.4/onD
c.
5 su!o 2e!it base-env.s
d. !:port *our ja8a home path e.g. e:port "A'AQHO%!R>usr>lib>j8m>oracleQjd7@>jd7/[email protected]/ e. Sa8e and e:it geditor f. !:it command prompt @. !:port hbaseQhome path in .bashrc file, run following command a. Open new terminal &command prompt( b .
5 su!o vi .basr/
c. Add following commands e:port H6AS!QHO%!R>usr>lib>hbase>hbase+/./. e:port PA$HRPA$H9H6AS!QHO%!>bin d. !:it 8i+editor ;. #ow start hadoop ser8ices, run following command a.
5 start-!Ds.s
b.
5 start-,arn.s
c. 'erif* that hadoop ser8ices are running, t*pe command > j$s
28
/-. #ow start hbase ser8ices, t*pe command a. 5 start-base.s
b. 'erif* that hbase ser8ices are running, t*pe command c. jps d. ollowing ser8ices name are displa*ed on command prompt Hmaster HregionSer8er HquorumPeer
29
//. 'erif* that on HDS &Hadoop Distributed ile s*stem( hbase director* is created, On ommand prompt enter following command a. hadoop fs +ls >tmp>hbase+hduser
/. On ommand prompt t*pe commands9 a. hbase shell b. After running abo8e command hbase prompt is displa*ed as c. hbase&main( 9--/9-
/?. $o 8erif* hbase running on web browser9 a. Open 3eb browser b.t*pe url tp9>>localhost9/1-/->master+status
Apache Pig Installation on Ubnt 16!04" 6elow are the steps for Apache Pig 0nstallation on Cinu: &ubuntu>centos>windows using Cinu: '%(. 0 am using Ubuntu /1.-J in below setup.
30
#(ep 1: Download "ig (ar file. 'ommand: we get http9>>www+us.apache.org>dist>pig>pig+-./1.->pig+-./1.-.tar.g #(ep 2: !:tract the (ar file using tar command. 0n below tar command, 8 means e:tract an archi8e file, B means filter an archi8e through gip, f means filename of an archi8e file. 'ommand: tar +:f pig+-./1.-.tar.g 'ommand: ls #(ep 3: !dit the 4.?as&r7 5 file to update the en8ironment 8ariables of Apache Pig. 3e are setting it so that we can access pig from an* director*, we need not go to pig director* to e:ecute pig commands. Also, if an* other application is loo7ing for Pig, it will get to 7now the path of Apache Pig from this file. 'ommand: sudo gedit .bashrc
Add the following at the end of the file9 # Set :<_9ME e+$ort :<_9ME;4ome4e!urea4$i2-0.16.0 e+$ort :7T;:7TB4ome4e!urea4$i2-0.16.04bin e+$ort :<_?>7SS:7T;799:_?9=@_
Also, ma7e sure that hadoop path is also set. )un below command to ma7e the changes get updated in same terminal. 'ommand: source .bashrc
31
#(ep !: hec7 pig 8ersion. $his is to test that Apache Pig got installed correctl*. 0n case, *ou donEt get the Apache Pig 8ersion, *ou need to 8erif* if *ou ha8e followed the abo8e steps correctl*. 'ommand: pig +8ersion
#(ep 59 hec7 pig help to see all the pig command options. 'ommand: pig +help #(ep <9 )un Pig to start the grunt shell. =runt shell is used to run Pig Catin scripts. 'ommand: pig
0f *ou loo7 at the abo8e image correctl*, Apache Pig has two modes in which it can run, b* default it chooses %ap)educe mode. $he other mode in which *ou can run Pig is Cocal mode. Cet me tell *ou more about this.
)8e7u(ion modes in Apa7&e "ig: •
•
Local Mode M 3ith access to a single machine, all files are installed and run using a local host and file s*stem. Here the local mode is specified using T+: flagE & pig -x local (. $he input and output in this mode are present on local file s*stem 'ommand: pig +: local
32
•
%ap)educe %ode M $his is the default mode, which requires access to a Hadoop cluster and HDS installation. Since, this is a default mode, it is not necessar* to specif* +: flag. $he input and output in this mode are present on HDS.
'ommand: pig +: mapreduce
33
34
3. ##T)* ANAL##
3. ##T)* ANAL##:
35
3.1 )8is(ing #ys(em: $he current processing of log files goes through ordinar* sequential wa*s in order to perform
preprocessing,
session
identification
and
user
identification.
$he
non+
Hadoopapproach loads the log file dataset, to process each line one after another. $he log field is then identified b* splitting the data and b* storing it in an arra* list. $he preprocessed logfield is stored in the form of hash table, with 7e* and 8alue pairs, where 7e* is the month and 8alue is the integer representing the month. 0n e:isting s*stem wor7 is possible to run onl* on single computer with a single ja8a 8irtual machine &"'%(. A "'% has the abilit* to handle a dataset based on )A% i.e. if the )A% is of =6 then a "'% can process dataset of onl* /=6. Processing of log files greater than /=6 becomes hectic. $he non+Hadoop approach is performed on ja8a /.1 with single "'%. Although batch processing can be found in these single+processor programs, there are problems in processing due to limited capabilities. $herefore, it is necessar* to use parallel processing approach to wor7effecti8el* on massi8e amount of large datasets.
3.2 "roposed #ys(em: Proposed solution is to anal*e web log generated b* Apache 3eb Ser8er. $his is helpful for statistical anal*sis. $he sie of web log can range an*where from a few F6 to hundreds of =6. Proposed mechanism design solution that based on different dimensions such as timestamp, browser, and countr*. 6ased on these dimension, we can e:tract pattern and information out of these log and pro8ides 8ital bits of information. $he technologies used are Apache Hadoop framewor7, Apache flume etc.Use Hadoop luster &=en/(. ontent will be created b* multiple 3eb ser8ers and logged in local hard discs. Proposed s*stem uses four node en8ironments where data is manuall* stored in localhard dis7 in local machine. $his log data will then be transferred to HDS using Pig Catin script. $his log data is processed b* %ap)educe to produce omma Separated 'alues i.e. S'.ind the areas where there e:ist errors or warnings in the ser8er. Also find the spammer 0Ps in the web application. $hen we use !:cel or similar software to produce statistical information and generate reports.
3.3 ,easi?ili(y #(udy:
36
$he feasibilit* of the project is anal*ed in this phase and business proposal is put forth with a 8er*general plan for the project and some cost estimates. During s*stem anal*sis the feasibilit* stud*ofthe proposed s*stem is to be carried out. $his is to ensure that the proposed s*stem is not a burdento the compan*. or feasibilit* anal*sis, some understanding of the major requirements for thes*stem is essential. 3.3.1)7onomi7 ,easi?ili(y:
$his stud* is carried out to chec7 the economic impact that the s*stem will ha8e on theOrganiation. $he amount of fund that the compan* can pour into the research and de8elopment ofthe s*stem is limited. $he e:penditures must be justified. $hus the de8eloped s*stem as well withinthe budget and this was achie8ed because most of the technologies used are freel* a8ailable. Onl*the customied products had to be purchased. 3.3.2 Te7&ni7al ,easi?ili(y:
$his stud* is carried out to chec7 the technical feasibilit*, that is, the technical requirements of thes*stem. An* s*stem de8eloped must not ha8e a high demand on the a8ailable technical resources.$his will lead to high demands on the a8ailable technical resources. $his will lead to high demandsbeing placed on the client. $he de8eloped s*stem must ha8e a modest requirement, as onl* minimal or null changes are required for implementing this s*stem. 3.3.3 #o7ial ,easi?ili(y:
$he aspect of stud* is to chec7 the le8el of acceptance of the s*stem b* the user. $his includes theprocess of training the user to use the s*stem efficientl*. $he user must not feel threatened b* thes*stem, instead must accept it as a necessit*. $he le8el of acceptance b* the users solel* depends onthe methods that are emplo*ed to educate the user about the s*stem and to ma7e him familiar withit. His le8el of confidence must be raised so that he is also able to ma7e some constructi8e criticism,which is welcomed, as he is the final user of the s*stem.
37
!. ##T)* R);-R)*)NT #")','AT+N#
!. ##T)* R);-R)*)NT#:
38
!.1 NTR+D-'T+N: Software )equirements Specification pla*s an important role in creating qualit* software solutions. Specification is basicall* a representation process. )equirements are represented in a manner that ultimatel* leads to successful software implementation. )equirements ma* be specified in a 8ariet* of wa*s. Howe8er there are some guidelines worth following9 + )epresentation format and content should be rele8ant to the problem 0nformation contained within the specification should be nested Diagrams and other notational forms should be restricted in number and consistent in use. )epresentations should be re8isable
!.2 N+NC,-N'T+NAL R);-R)*)NT#: -sa?ili(y: Usabilit* is the ease of use and learns abilit* of a human+made object. $he object of use can be a software application, website, boo7, tool, machine, process, or an*thing a human interacts with. A usabilit* stud* ma* be conducted as a primar* job function b* a usabilit* anal*st or as a secondar* job function b* designers, technical writers, mar7eting personnel, and others.
Relia?ili(y: $he probabilit* that a component part, equipment, or s*stem will satisfactoril* perform its intended function under gi8en circumstances, such as en8ironmental conditions, limitations as to operating time, and frequentl* and thoroughness of maintenance for a specified period of time.
"erforman7e: Accomplishment of a gi8en tas7 measured against preset standards of accurac*, completeness, cost, and speed.
#uppor(a?ili(y:
39
$o which the design characteristics of a stand b* or support s*stem meet the operational requirements of an organiation.
mplemen(a(ion: 0mplementation is the realiation of an application, or e:ecution of a plan, idea, model, design, specification, standard, algorithm, or polic*
n(erfa7e: An interface refers to a point of interaction between components, and is applicable at the le8el of both hardware and software. $his allows a component whether a piece of hardware such as a graphics card or a piece of software such as an internet browser to function independentl* while using interfaces to communicate with other components 8ia an input>output s*stem and an associated protocol.
Legal: 0t is established b* or founded upon law or official or accepted rules of or relating to jurisprudenceI 4legal loophole5. Ha8ing legal efficac* or forceE, 4a sound title to the propert*5 )elating to or characteristic of the profession of law, 4the legal profession5. Allowed b* official rulesI 4a legal pass recei8er5.
!.3 ##T)* R);-R)*)NT#: #+,TAR) R);-R)*)NT#: •
Operating S*stem 9 Ubuntu /J.-J
•
oding Canguage 9 "a8a
•
Scripting Canguage9 Pig Catin Script
•
0D! 9 !clipse
•
3eb Ser8er 9$omcat
•
Database 9 HDS
HARDAR) R);-R)*)NT#:
40
•
Processor $*pe 9 0ntel &an* 8ersion(
•
Speed 9 /./ =HK
•
)A% 9 J=6
•
Hard dis7 9 - =6
•
Fe*board 9 /-/>/- Standard Fe*s
41
5. ##T)* D)#/N
5. ##T)* D)#/N:
42
5.1 NTR+D-'T+N: $he most creati8e and challenging phase of the life c*cle is s*stem design. $he term design describes a final s*stem and the process b* which it is de8eloped. 0t refers to the technical specifications that will be applied in implementations of the candidate s*stem. $he design ma* be defined as 4the process of appl*ing 8arious techniques and principles for the purpose of defining a de8ice, a process or a s*stem with sufficient details to permit its ph*sical realiation5. $he designerEs goal is how the output is to be produced and in what format.Samples of the output and input are also presented.Second input data and database files ha8e to be designed to meet the requirements of the proposed output. $he processing phases are handled through the program onstruction and $esting. inall*, details related to justification of the s*stem and an estimate of the impact of the candidate s*stem on the user and the organiation are documented and e8aluated b* management as a step toward implementation. $he importance of software design can be stated in a single word 4ualit*5. Design pro8ides us with representations of software that can be assessed for qualit*. Design is the onl* wa* where we can accuratel* translate a customerEs requirements into a complete software product or s*stem. 3ithout design we ris7 building an unstable s*stem that might fail if small changes are made. 0t ma* as well be difficult to test, or could be one whoEs qualit* canEt be tested. So it is an essential phase in the de8elopment of a software product.
5.2Hig&Cle@el design: High Ce8el Design defines a complete scale architecture of the de8eloping s*stem required. 0n short it is an o8erall representation of a design required for our target de8eloping s*stem>application. 0t is usuall* done b* higher le8el professionals>software architects.
5.3 LoCle@el design: 5.3.1 -*L DA/RA*#: $he U%C is a language for •
'isualiing
•
Specif*ing
43
•
onstructing
•
Documenting
$hese are the artifacts of a software+intensi8e s*stem. A 7on7ep(ual model of -*L:
$he three major elements of U%C are /. $he U%CEs basic building bloc7s . $he rules that dictate how those building bloc7s ma* be put together. ?. Some common mechanisms that appl* throughout the U%C. asi7 ?uilding ?lo79s of (&e -*L:
$he 8ocabular* of U%C encompasses three 7inds of building bloc7s9 /. $hings . )elationships ?. Diagrams $hings are the abstractions that are first+class citiens in a modelI )elationships tie these things togetherI Diagrams group the interesting collection of things. T&ings in -*L9 $here are four 7ind of things in the U%C
/. Structural things . 6eha8ioral things. ?. =rouping things J. Annotational things $hese things are the basic object oriented building bloc7s of the U%C.$he* are used to write well+formed models.
#TR-'T-RAL THN/#: Structural things are the nouns of the U%C models. $hese are mostl* static parts of the model, representing elements that are either conceptual or ph*sical. 0n all, there are se8en 7inds of Structural things.
5.3.2 -#)'A#) DA/RA*: User9 $he* ha8e accessibilit* upon the o8erall s*stem with specific to the data insertion, deletion, Updation and queries. $he* are the highest authorities within the s*stem, which ha8e ma:imumcontrol upon the entire database.
44
45
5.3.3 'LA## DA/RA*:
5.3.3 #);-)N') DA/RA*:
46
Analysis of #ample Log file using "ig La(in #7rip(: $he log file consists of different parameters of the form $ransactionQdate,Product,Price,Pa*mentQ$*pe,#ame,it*,State,ountr*,AccountQreated,Ca stQCogin,Catitude,Congitude . $o anal*e the log file Pig Catin script is used as below. 0nitiall* the log file is in the local file s*stem, to load it into the Hadoop Distributed ile S*stem&HDS( the following commands ha8e to e:ecuted. S$!P /9 irstl* we ha8e to create a folder in the HDS using the commands 'ommand: hdfs dfs +m7dir >pigdata
#ow the folder is created in the hdfs as shown below snapshot #ow to load the log file into pig data the following command has to be e:ecuted 'ommand9 hdfs dfs +put >home>la77iredd*>sales.cs8 >pigdata
6* this command the log files are loaded into hdfs, now we can run our pig script to anal*e the log files from mapreduce mode rather than local mode. After loading the log fileinto hdfs we ha8e to write the Pig Script to anal*e the particular log file which is loaded into hdfs. $he format of pig script will be different for t*pe of log file used for 7nowledge disco8er* or anal*sis of s*stem threats or anal*sis of user call log data. 0n pig latin script we can e:tract the log file data based on our requirement, b* using the particular pig quer*, as shown in the below snapshot . 6* using the following command we can e:ecute the log file in mapreduce mode. 'ommand: pig +: mapreduce sales.pig
$he mapreduce jobs will run simultaneousl* as shown in the below snapshot 0n the abo8e snap shot we can see the mapreduce job is completed @-V and read* to displa* the output. $he output will be displa*ed as shown below in the command prompt 0norder to displa* the following output in the hdfs b* using the following quer* in the pig Catin script we can. Fuer,B ST9RE out$ut <=T9 G$i2out$ut%G HS<= :i2Stora2e(GItG)
0n hdfs the output will be displa*ed in the following formatted 6* clic7ing the folder pigoutputJ we will be redirected to output file location as shown below snapshot
47
$hen after b* clic7ing the part+m+----- file the download option will be a8ailable to download the log anal*sis result as shown below. After clic7ing the download option the output file will be downloaded and the result will be li7e.
48
<. '+DN/
<. '+DN/:
49
D' program (o load da(a: pac7agenet.codeja8a.uploadI import ja8a.io.WI import ja8a.net.U)CI importja8a.net.U)ConnectionI importja8a.net.U)C!ncoderI importja8a.sql.WI importja8a.util.!numerationI importja8a.util.0teratorI importja8a.util.CistI importja8a:.ser8let.WI importja8a:.ser8let.http.WI importorg.apache.commons.fileupload.ile0temI importorg.apache.commons.fileupload.ile0temactor*I importorg.apache.commons.fileupload.ileUpload!:ceptionI importorg.apache.commons.fileupload.dis7.Dis7ile0temactor*I importorg.apache.commons.fileupload.ser8let.Ser8letileUploadI Ser8let implementation class getCogin public class Uploadile e:tends HttpSer8let X pri8ate static final long serial'ersionU0D R /2@1J;@1J1@J;J@1JCI >> location to store file uploaded pri8ate static final String UPCOADQD0)!$O)G R > upload settings publicUploadile&( X super&(I >> $ODO Auto+generated constructor stub Y
50
>WW W Zsee HttpSer8let[do=et&HttpSer8let)equ est request, HttpSer8let)esponse response( W> protected 8oid do=et&HttpSer8let)equest request, HttpSer8let)esponse response( throws Ser8let!:ception, 0O!:ception X >> $ODO Auto+generated method stub >>doPost&request, response(I >>throw new Ser8let!:ception&<=!$ method used with < \ getlass& (.get#ame& (\<9 POS$ method required.<(I request.get)equestDispatcher&<>3!6+0#>inde:.jsp<(.forward&request, response(I Y >WW W Zsee HttpSer8let[doPost&HttpSer8let)equest request, HttpSer8let)esponse response( W> protected 8oid doPost&HttpSer8let)equest request, HttpSer8let)esponse response( throws Ser8let!:ception, 0O!:ception X S*stem.out.println&> if not, we stop here Print3riter writer R response.get3riter&(I writer.println&form+data.<(I writer.flush&(I returnI Y
51
>> configures upload settings Dis7ile0temactor* factor* R new Dis7ile0temactor*&(I >> sets temporar* location to store files factor*.set)epositor*&new ile&S*stem.getPropert*&> constructs the director* path to store upload file >> this path is relati8e to applications director* >>
String uploadPath R getSer8letonte:t&(.get)ealPath&<<(\ ile.separator \
UPCOADQD0)!$O)GI String uploadPath R <9>hadoop+.?.->hadoop+dir>datanode+dir<\ ile.separator \ UPCOADQD0)!$O)GI >> creates the director* if it does not e:ist ile uploadDir R new ile&uploadPath(I if &]uploadDir.e:ists&(( X uploadDir.m7dir&(I Y tr* X >> parses the requests content to e:tract file data
S*stem.out.println&uploadPath(I Cistile0temform0tems R upload.parse)equest&&HttpSer8let)equest(request(I if &form0tems ]R null BBform0tems.sie&( -( X >> iterates o8er forms fields
52
for &ile0tem item 9 form0tems( X >> processes onl* fields that are not form fields if &]item.isormield&(( X String file#ame R new ile&item.get#ame&((.get#ame&(I String filePath R uploadPath \ ile.separator \ file#ameI ile storeile R new ile&filePath(I >> 9^tomcat^apache+tomcat+2.-.J-^webapps^data^ >> sa8es the file on dis7 item.write&storeile(I request.setAttribute&
,ileupload.ja@a:
53
pac7ageHdfsileOperationI import ja8a.io.WI importorg.apache.hadoop.conf.onfigurationI importorg.apache.hadoop.fs.WI public class Operations X public static 8oid main&String_` args( throws 0O!:ception X ileS*stemhdfs RileS*stem.get&new onfiguration&((I >>Print the home director* S*stem.out.println&4Home folder +5 \hdfs.getHomeDirector*&((I >> reate B Delete Directories Path wor7ingDirRhdfs.get3or7ingDirector*&(I Path newolderPathR new Path&4>%*Dataolder5(I newolderPathRPath.mergePaths&wor7ingDir, newolderPath(I if&hdfs.e:ists&newolderPath(( X >>Delete e:isting Director* hdfs.delete&newolderPath, true(I S*stem.out.println&4!:isting older Deleted.5(I Y hdfs.m7dirs&newolderPath(I
>>reate new Director*
S*stem.out.println&4older reated.5(I >>op*ing ile from local to HDS Path localilePath R new Path&4c9>>localdata>datafile/.t:t5(I
54
Path hdfsilePathR new Path&newolderPath\5>dataile/.t:t5(I hdfs.cop*romCocalile&localilePath, hdfsilePath(I S*stem.out.println&4ile copied from local to HDS.5(I >>op*ing ile from HDS to local localilePathRnew Path&4c9>>hdfsdata>datafile/.t:t5(I hdfs.cop*$oCocalile&hdfsilePath, localilePath(I S*stem.out.println&4iles copied from HDS to local.5(I >>reating a file in HDS Path newilePath R new Path&newolderPath\5>newile.t:t5(I hdfs.create#ewile&newilePath(I >>3riting data to a HDS file String6uildersb R new String6uilder&(I for&intiR/IiRIi\\( X sb.append&4Data5(I sb.append&i(I sb.append&4^n5(I Y b*te_` b*t R sb.toString&(.get6*tes&(I SDataOutputStreamfsOutStream R hdfs.create&newilePath(I fsOutStream.write&b*t(I fsOutStream.close&(I S*stem.out.println&43ritten data to HDS file.5(I
55
>>)eading data rom HDS ile S*stem.out.println&4)eading from HDS file.5(I 6uffered)eaderbfr R new 6uffered)eader& new0nputStream)eader&hdfs.open&newilePath(((I String str R nullI while &&str R bfr.readCine&((]R null( X S*stem.out.println&str(I Y Y Y
*ain.ja@a: import ja8a.io.ileI import ja8a.io.0O!:ceptionI import org.apache.hadoop.conf.onfigurationI import org.apache.hadoop.fs.ileS*stemI import org.apache.hadoop.fs.PathI import org.apache.hadoop.io.0nt3ritableI import org.apache.hadoop.io.$e:tI import org.apache.hadoop.mapreduce."obI import org.apache.hadoop.mapreduce.lib.input.ile0nputormatI import org.apache.hadoop.mapreduce.lib.input.$e:t0nputormatI
56
import org.apache.hadoop.mapreduce.lib.output.ileOutputormatI import org.apache.hadoop.mapreduce.lib.output.$e:tOutputormatI
public class %ain X >W W $his program processes Apache H$$P Ser8er log files using %ap)educe W> public static 8oid main&String_` args( throws 0O!:ception, lass#otound!:ception, 0nterrupted!:ception X S*stem.out.println&
S*stem.e:it&/(I Y onfiguration conf R new onfiguration&(I
conf.set& `\^^s_\^^+`^^dXJY(^^` ^<&.\(^< &^^dX?Y( &^^d\( ^<&_^<`
\(^< ^<&_^<`\(^<<(I conf.set&
count"ob.set"ar6*lass&%ain.class(I
count"ob.set%apOutputFe*lass&$e:t.class(I
57
count"ob.set%apOutput'aluelass&0nt3ritable.class(I
count"ob.setOutputFe*lass&$e:t.class(I
count"ob.setOutput'aluelass&0nt3ritable.class(I
count"ob.set%apperlass&ount%apper.class(I
count"ob.set)educerlass&ount)educer.class(I
count"ob.set0nputormatlass&$e:t0nputormat.class(I
count"ob.setOutputormatlass&$e:tOutputormat.class(I >> this performs reduces on the %ap outputs before its sent to the >> )educer
count"ob.setombinerlass&ount)educer.class(I Path inputile R new Path&args_-` \ ile.separator \ > Perform some chec7ing on the input and output files ileS*stem fileS*stem R ileS*stem.get&conf(I if &]fileS*stem.e:ists&inputile(( X S*stem.err.println&<0nput file does not e:ist] + < \ inputile.getParent&((I
returnI Y if &fileS*stem.e:ists&countOutput(( X fileS*stem.delete&countOutput, true(I
S*stem.out
58
.println&
fileS*stem.close&(I ile0nputormat.add0nputPath&count"ob, inputile(I ileOutputormat.setOutputPath&count"ob, countOutput(I
count"ob.waitorompletion&true(I S*stem.out.println&
Y
*apper.ja@a : import ja8a.io.0O!:ceptionI import ja8a.util.rege:.%atcherI import ja8a.util.rege:.PatternI import org.apache.hadoop.conf.onfigurationI import org.apache.hadoop.io.0nt3ritableI import org.apache.hadoop.io.$e:tI import org.apache.hadoop.mapreduce.%apperI public class ount%apper e:tends %apperObject, $e:t, $e:t, 0nt3ritable X pri8ate final static 0nt3ritable one R new 0nt3ritable&/(I >WW W W Zparam 7e* W Zparam 8alue W
a line from a log file
59
W Zparam conte:t W W> ZO8erride protected 8oid map&Object 7e*, $e:t 8alue, onte:t conte:t( throws 0O!:ception, 0nterrupted!:ception X onfiguration conf R conte:t.getonfiguration&(I Pattern log!ntr*Pattern R Pattern.compile&conf.get&
>W W or each entr* in the log file, generate a 7>8 pair for e8er* field W were interested in counting. $hese are encoded in a string of W integers in the job conf 8ariable fields$oount. $he reducer will W simpl* add up occurrences of each field 7e* such as an 0P address, W H$$P response, User Agent etc. $his mapper is 8er* generic and the W field mapping relies on the regular e:pression used to split each W line into a set number of fields. W> for &int i R -I i entries.lengthI i\\( X %atcher log!ntr*%atcher R log!ntr*Pattern.matcher&entries_i`(I if &log!ntr*%atcher.find&(( X for &String inde: 9 fields$oount( X if&]inde:.equals&<<(( X $e:t 7 R new $e:t&inde: \ < < \ log!ntr*%atcher.group&0nteger.parse0nt&inde:(((I
60
conte:t.write&7, one(I Y Y Y Y Y Y
Redu7er.ja@a: import ja8a.io.0O!:ceptionI import org.apache.hadoop.io.0nt3ritableI import org.apache.hadoop.io.$e:tI import org.apache.hadoop.mapreduce.)educerI public class ount)educer e:tends )educer$e:t, 0nt3ritable, $e:t, 0nt3ritable X pri8ate 0nt3ritable total R new 0nt3ritable&-(I >W Zsee org.apache.hadoop.mapreduce.)educer[reduce&F!G0#, W ja8a.lang.0terable, org.apache.hadoop.mapreduce.)educer.onte:t( W> ZO8erride protected 8oid reduce&$e:t 7e*, 0terable0nt3ritable 8alues, onte:t conte:t( throws 0O!:ception, 0nterrupted!:ception X int sum R -I for &0nt3ritable 8alue 9 8alues( X sum \R 8alue.get&(I Y
61
total.set&sum(I conte:t.write&7e*, total(I Y
Y
62
$. T)#TN/
$. T)#TN/:
63
$he purpose of testing is to disco8er errors. $esting is the process of tr*ing to disco8er e8er* concei8able fault or wea7ness in a wor7 product. 0t pro8ides a wa* to chec7 the functionalit* of components, sub+assemblies, assemblies and>or a finished product 0t is the process of e:ercising software with the intent of ensuring that the Software s*stem meets its requirements and user e:pectations and does not fail in an unacceptable manner. $here are 8arious t*pes of test. !ach test t*pe addresses a specific testing requirement.
$.1 T")# +, T)#T#: $.1.1 -ni( (es(ing: Unit testing in8ol8es the design of test cases that 8alidate that the internal program logic is functioning properl*, and that program inputs produce 8alid outputs. All decision branches and internal code flow should be 8alidated. 0t is the testing of indi8idual software units of the application .it is done after the completion of an indi8idual unit before integration. $his is a structural testing, that relies on 7nowledge of its construction and is in8asi8e. Unit tests perform basic tests at component le8el and test a specific business process, application, and>or s*stem configuration. Unit tests ensure that each unique path of a business process performs accuratel* to the documented specifications and contains clearl* defined inputs and e:pected results.
$.1.2 n(egra(ion (es(ing: 0ntegration tests are designed to test integrated software components to determine if the* actuall* run as one program. $esting is e8ent dri8en and is more concerned with the basic outcome of screens or fields. 0ntegration tests demonstrate that although the components were 0ndi8iduall* satisfaction, as shown b* successfull* unit testing, the combination of components is correct and consistent. 0ntegration testing is specificall* aimed at
e:posing
the problems that arise from the combination of components
$.1.3 ,un7(ional (es(ing: unctional tests pro8ide s*stematic demonstrations that functions tested are a8ailable as specified b* the business and technical requirements, s*stem documentation, and user manuals. unctional testing is centered on the following items9
64
• • • • •
'alid 0nput 'alid 9 identified classes of 8alid input must be accepted. 0n8alid 0nput 9 identified classes of in8alid input must be rejected. unctions 9 identified functions must be e:ercised. Output 9 identified classes of application outputs must be e:ercised. S*stems>Procedures 9 interfacing s*stems or procedures must be in8o7ed.
Organiation and preparation of functional tests is focused on requirements, 7e* functions, or special test cases. 0n addition, s*stematic co8erage pertaining to identif* 6usiness process flowsI data fields, predefined processes, and successi8e processes must be considered for testing. 6efore functional testing is complete complete,, additional tests are identified and the effecti8e 8alue of current tests is determined.
$.1.! #ys(em Tes(: S*stem testing ensures that the entire integrat integrated ed software s*stem meets requirements. requirements. 0t tests a configuration to ensure 7nown and predictable results. An e:ample of s*stem testing is the configuration oriented s*stem integration test. S*stem testing is based on process descriptions and flows, emphasiing pre+dri8en process lin7s and integration points.
$.1.5 &i(e o8 Tes(ing: 3hite 6o: $esting is a testing in which in which the software tester has 7nowledge of the inner wor7ings, structure and language of the software, or at least its purpose. 0t is purpose. 0t is used to test tes t areas that cannot be reached re ached from a blac7 bo: le8el. le8 el.
$.1.< la79 o8 Tes(ing: 6lac7 6l ac7 6o 6o: : $est sting ing is te testi sting ng th thee so softw ftware are wi witho thout ut an* 7no 7nowl wledg edgee of the inn inner er wor7ings, structure or language of the module being tested. 6lac7 bo: tests, as most other 7inds of tests, must be written from a definiti8e source document, such as specification or requirements document, such as specification or requirements document. 0t is a testing in which the software under test is treated, as a blac7 bo: .*ou cannot 4see5 into it. $he test pro8ides inputs and an d responds to outputs outpu ts without considering how the software wor7s.
$.1.$ -ni( Tes(ing:
65
Unit testing is usuall* conducted conducted as part of a combined code and unit test phase of the software lifec*cle, although it is not uncommon for coding and unit testing to be conducted as two distinct phases.
$.2 T)#T T)# T #TRAT)/ AND AND A""R+A'H: ield testing will be performed manuall* and functional tests will be written in detail.
$.2.1 Tes( o?je7(i@es: • • •
All field entries must wor7 properl*. Pages must be acti8ated from the identified lin7. $he entr* screen, messages and responses must not be dela*ed.
$.2.2 ,ea(ures (o ?e (es(ed9 • •
'erif* that the entries are of the correct format #o duplicate entries should sh ould be allowed
All lin7s should ta7e the user to the correct spage.
n(egra(ion Tes(ing: Software integration testing is the incremental integration testing of two or more integrated software components on a single platform to produce failures caused b* interface defects. $he tas7 of the integrati integration on test is to chec7 that components or software applications, applications, e.g. components in a software s*stem or M one step up M software applications at the compan* le8el M interact without error. s.no
$estcase
$est
name
Description
/
user
0P
user
U)C
?
user
J
user
user
case !:pected output
Actual output
)esult
/;./1@.@
/;./1@.@
success
http9>>google.co
http9>>google.co
success
m
m
66
Tes( Resul(s: All the test cases mentioned abo8e passed successfull*. #o defects encountered.
A77ep(an7e Tes(ing: User Acceptance $esting is a critical phase of an* project and requires significant participation b* the end user. 0t also ensures that the s*stem meets the functional requirements.
Tes( Resul Resul(s: (s: All th thee te test st ca cases ses me menti ntione oned d ab abo8e o8e pas passe sed d suc succes cessfu sfull* ll*.. #o def defect ectss encountered.
67
=. #'R))N#H+T#
=. #'R))N#H+T#:
68
69
70
71
.
72
73
74
>. '+N'L-#+N
>. '+N'L-#+N:
75
Cog anal*sis helps to impro8e the business strategies as well as to generate statistical reports. Hadoop %ap)educe based log file anal*sis tool will pro8ide us graphical reports showing hits for web pages, userEs page 8iew acti8it*, in which part of website users are interested, traffic attac7 etc. rom these reports business communities can e8aluate which parts of the website need to be impro8ed on behalf, which are the potential customers, from which 0P or area or region website is getting ma:imum hits, etc., which will be help in designing future business and mar7eting plans. Hadoop %ap)educe framewor7 pro8ides parallel distributed computing and reliable data storage b* replicating data for large 8olumes of log files. irstl*, data get stored bloc7 wise in rac7 on se8eral nodes in a cluster so that access time required can be reduced which sa8es much of the processing time and enhance performance. Here HadoopEs characteristic of mo8ing computation to the data rather mo8ing data to computation helps to impro8e response time. Secondl*, %ap)educe successfull* wor7s distributed for large datasets gi8ing the more efficient results3eb Ser8er Cog Processing has bright, 8ibrant scope in the field of information technolog*.0$ organiations anal*e ser8er logs to answer questions about securit* and compliance. Proposed s*stemwill focus on a networ7 securit* use case. Specificall*, we will loo7 at how Apache Hadoop can help the administrator of a large enterprise networ7 diagnose and respond to a distributed denial+of+ser8ice attac7.
76
10. L+/RA"H
10.L+/RA"H:
77
http9>>tipsonubuntu.com>-/1>-2>?/>install+oracle+ja8a+@+;+ubuntu+/1+-J+linu:+mint+ /@>
http9>>www.tecadmin.net>setup+hadoop++J+single+node+cluster+on+linu:>[
http9>>www.wi7ihow.com>Set+Up+Gour+"a8aQHome+Path+in+Ubuntu
https9>>hadoop.apache.org>docs>stable>hadoop+project+dist>hadoop+ hdfs>HdfsUser=uide.html
https9>>hadoop.apache.org>docs>stable>hadoop+project+dist>hadoop+ common>Singleluster.html
https9>>www.tutorialspoint.com>apacheQpig>apacheQpigQinstallation.htm
https9>>pig.apache.org>docs>r-.2.->setup.html
http9>>stac7o8erflow.com>questions>/J1/J>log+files+in+hbase
https9>>communit*.hortonwor7s.com>content>support7b>J;/1>where+can+i+find+ region+ser8er+log.html
http9>>data+flair.training>blogs>install+run+apache+pig+ubuntu+quic7start+guide>
http9>>blogs.perficient.com>deli8er*>blog>-/>-;>-;>some+wa*s+load+data+from+hdfs+ to+hbase>
http9>>www.tr*techstuff.com>how+to+install+pig+on+ubuntulinu:>
https9>>www.*outube.com>results searchQquer*Rhow\to\load\unstructured\data\into\hadoop
https9>>sreejithrpillai.wordpress.com>-/>-/>-@>bul7loading+data+into+hbase+table+ using+mapreduce>
http9>>www.cloudera.com>documentation>cdh>+-+:>DH+0nstallation+ =uide>cdhigQpigQinstall.html
http9>>www.tecadmin.net>steps+to+install+tomcat+ser8er+on+centos+rhel>
http9>>hadooptutorial.info>pig+installation+on+ubuntu>
78