RECOMMENDER SYSTEM USING APACHE HADOOP Submitted by
ANKAN BANERJEE(13000112064) ANKIT (13000112065) ANKIT GUPTA (13000112067) HEMANT KUMAR JOSHI (13000112101)
Under the guidance of Mr. TAPAN CHOWDHURY.
Submitted for the partial fulfillment for the degree of Bachelor of Technology in Computer Science and Engineering Engineering
Techno India EM 4/1, Salt Lake, Sector – Sector – V, V, Kolkata – Kolkata – 700 700 091.
CERTIFICATE This is to certify that the project report entitled “Recommender System Using Apache Hadoop” prepared under my supervision by Ankan Banerjee (13000112064), (13000112064 ), Ankit (13000112065), Ankit Gupta (13000112067) and Hemant Kumar Joshi (13000112101) of B.Tech. (Computer Science & Engg.), Final Year, has been done according to the regulations of the Degree of Bachelor of Technology in Computer Science & Engineering. The candidates have fulfilled the requirements for the submission of the project report. It is to be understood that, the undersigned does not necessarily endorse any statement made, opinion expressed or conclusion drawn thereof, but approves the report only for the purpose for which it has been submitted.
----------------------------------------------Mr. Tapan Chowdhury Asst. Professor Computer Science and Engineering Techno India Salt Lake
----------------------------------------------Prof. (Dr.) C. K. Bhattacharyya Head of Department Computer Science and Engineering Techno India Salt Lake
----------------------------------------------(Signature of the External Examiner with Designation and Institute)
-------------------------------------------------------------------------------------------DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING --------------------------------------------------------------------------------------------
CERTIFICATE This is to certify that the project report entitled “Recommender System Using Apache Hadoop” prepared under my supervision by Ankan Banerjee (13000112064), (13000112064 ), Ankit (13000112065), Ankit Gupta (13000112067) and Hemant Kumar Joshi (13000112101) of B.Tech. (Computer Science & Engg.), Final Year, has been done according to the regulations of the Degree of Bachelor of Technology in Computer Science & Engineering. The candidates have fulfilled the requirements for the submission of the project report. It is to be understood that, the undersigned does not necessarily endorse any statement made, opinion expressed or conclusion drawn thereof, but approves the report only for the purpose for which it has been submitted.
----------------------------------------------Mr. Tapan Chowdhury Asst. Professor Computer Science and Engineering Techno India Salt Lake
----------------------------------------------Prof. (Dr.) C. K. Bhattacharyya Head of Department Computer Science and Engineering Techno India Salt Lake
----------------------------------------------(Signature of the External Examiner with Designation and Institute)
-------------------------------------------------------------------------------------------DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING --------------------------------------------------------------------------------------------
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to Mr. Tapan Chowdhury of the department of Computer Science and Engineering, whose role as project guide was invaluable for the project. We are extremely thankful for the keen interest he / she took in advising us, for the books and reference materials provided for the moral support extended to us. Last but not the least we convey our gratitude to all the teachers for providing us the technical skill that will always remain as our asset and to all non-teaching staff for the gracious hospitality they offered us.
Place: Techno India, Salt Lake Date: 12th May, 16 ……………………………… Ankan Banerjee
……………………………… Ankit
……………………………… Ankit Gupta
……………………………… Hemant Kumar Joshi
Contents 1. INTRODUCTION ................................................................................................................................... 6 1.1 Abstract ......................................................................................................................................... 6 1.2 Problem Domain ............................................................................................................................ 6 1.2.1 Software and Language Versions ............................................................................................... 6 1.2.2 Hardware Specification of each Hadoop Node .......................................................................... 6 1.2.3 Business Domain ........................................................................................................................ 7 1.3. Related Study ............................................................................................................................... 7 1.4 Glossary ............................................................................................................................................. 7 2. PROBLEM DEFINITION ......................................................................................................................... 9 2.1 Scope ............................................................................................................................................. 9 2.2 Exclusions ................. ................. ................. .................. ................. .................. .................. .......... 9 2.3 Assumption .................. ................. ................. .................. ................. .................. .................. ....... 9 3. PROJECT PLANNING ........................................................................................................................... 10 3.1 Software Life Cycle Model ........................................................................................................... 10 3.2 Scheduling ................................................................................................................................... 11 3.3 Cost Analysis ................................................................................................................................ 13 4. REQUIREMENT ANALYSIS .................................................................................................................. 14 4.1 Requirement Matrix .................................................................................................................... 14 4.2 Requirement Elaboration ............................................................................................................ 15 5. DESIGN ............................................................................................................................................... 17 5.1 Technical Environment ................................................................................................................ 17 5.2 Hierarchy of modules .................................................................................................................. 17 5.4 Test Plan ...................................................................................................................................... 26 6. Implementation ................................................................................................................................. 28 6.1 Implementation Details ............................................................................................................... 28 6.1.1 Cluster Configuration ............................................................................................................... 28 6.1.2 Data Dtorage ............................................................................................................................ 28 6.1.3 Analysis of Data And Recommendation ................................................................................... 29 6.2 System Installation Step .............................................................................................................. 31 6.3 System Usage Instruction ............................................................................................................ 33 7. Conclusion ......................................................................................................................................... 34 7.1 Project Benefits ........................................................................................................................... 34 7.2 Future Scope for improvements ................................................................................................. 34 8. References ......................................................................................................................................... 35 APPENDIX .............................................................................................................................................. 37 A.1 Core-site.xml ............................................................................................................................... 37 A.2 localhost:54310 ........................................................................................................................... 37 A.3 Hdfs-site.xml ............................................................................................................................... 38 4
A.4 U.data .......................................................................................................................................... 38 A.5 MapReduce 1 .............................................................................................................................. 39 A.6 MapReduce 2 .............................................................................................................................. 39 A.7 MapReduce 3 .............................................................................................................................. 40
List of Tables: Table 1.1 Table 6.1 List of Figures: Figure 4.1: Iterative Waterfall Model Figure 4.2: Gantt Chart Figure 4.3: Cost Analysis Figure 5.1: Requirement Matrix Figure 6.1: Hierarchy of Modules Figure 6.2: Use Case Diagram Figure 6.3: Class Diagram Figure 6.4: Collaborative Filtering
Page Numbers 6 26 Page Numbers 10
12 13 14 17 17 18 21
5
1. INTRODUCTION 1.1 Abstract
Recommender Systems are new generation internet tool that help user in navigating through information on the internet and receive information related to their preferences. [1] Although most of the time recommender systems are applied in the area of online shopping and entertainment domains like movie and music, yet their applicability is being researched upon in other area as well. This report presents an overview of the Recommender Systems which are currently working in the domain of online movie recommendation.[2] This report also proposes a new movie recommender system that combines user choices with not only similar users but other users as well to give diverse recommendation that change over time. The overall architecture of the proposed system is presented. [3]
1.2 Problem Domain 1.2.1 Software and Language Versions
Hadoop 1.2.1 Java 1.6 1.2.2 Hardware Specification of each Hadoop Node
Hadoop clusters have identical hardware specifications for all the cluster nodes. Table lists the specification of nodes. Operating System
Ubuntu 12.04 LTS (64 bit)
Processor
Intel Core i3 (Quad Core)
Memory
3GB
Disk Space
160GB Table 1.1 : Hardware Specification
6
1.2.3 Business Domain
Traditional recommender systems suggest items belonging to a single domain that is movies in Netflix, songs in Last.fm, etc. [4] This is not perceived as a limitation, but as a focus on a certain market. Recommender systems may be used primarily as systems that suggests appropriate actions to satisfy a user’s needs.
1.3. Related Study
Recommender systems typically produce a list of recommendations in one of two ways through collaborative or content-based filtering. Collaborative filtering approaches building a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users.[2] This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties. These approaches are often combined. [3][5] Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. [11] All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework. The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). [15] Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged code for nodes to process in parallel, based on the data each node needs to process. This approach takes advantage of data locality — nodes manipulating the data that they have on hand — to allow the data to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via highspeed networking. [13][15]
1.4 Glossary Apache Hadoop: Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
7
NameNode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. DataNode: A DataNode stores data in the HDFS. A functional filesystem has more than one DataNode, with data replicated across them. JobTracker: The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. TaskTracker: A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker. Replication Factor: Replication Factor tells about data replication on multiple nodes by the way we can achieve high fault tolerant and high availablity. HDFS: The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop project. This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.
8
2. PROBLEM DEFINITION 2.1 Scope
Recommender systems are widespread tools that are employed by a wide range of organizations and companies for recommending items such as movies, books and even employees for projects. But with the advent of big data it has become difficult to process the large amount of data for recommendations. Due to this reason, Apache Hadoop is employed for scalability, reliability and faster processing. [1][5] Recommender systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that user would give to an item. Recommender systems have become extremely common in recent years, and are applied in a variety of applications. 2.2 Exclusions
Big Data Collection Interface and modification of user data (ratings). 2.3 Assumption Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system‘s data. The fact that there are a large number of components and that each component has a nontrivial probability of failure means that some component of HDFS is always nonfunctional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. [5] Large Data Set
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. 9
3. PROJECT PLANNING 3.1 Software Life Cycle Model Iterative Waterfall Model
The problems with the Waterfall Model created a demand for a new method of developing systems which could provide faster results, require less up-front information and offer greater flexibility. Iterative model, the project is divided into small parts. This allows the development team to demonstrate results earlier on in the process and obtain valuable feedback from system users. Each iteration is actually a mini-Waterfall process with the feedback from one phase providing vital Information for the design of the next phase.
Figure 3.1: Iterative Waterfall Model
10
3.2 Scheduling
Figure 3.2.1: Gantt Chart 11
Figure 3.2.1: Gantt Chart 12
3.3 Cost Analysis
Here we have assumed that Standard rate of each resource person is $2/hour. According to this assumption the Total cost of this project will be $14,820.00.The Budget and Cost analysis is given below.
Fig 4.3: Cost Analysis
13
4. REQUIREMENT ANALYSIS 4.1 Requirement Matrix
Figure 5.1: Requirement Matrix 14
4.2 Requirement Elaboration 4.2.1 Cluster Configuration
The project has to be implemented on huge sets of data, namely the Big Data. Hence, an HDFS cluster of five computers has been configured as per the project requirements. One computer has been configured as the NameNode and the rest four as the DataNode. 4.2.1.1 Rack Awareness Implementation
As per the project requirement, three racks of computers have been setup out of which, there are two racks having two computers each and one having one computer, the NameNode. The replication factor has been set to three so that for each block of data, there are 3 copies, each on one of the racks. 4.2.2 Data Storage
Since the project is based on huge sets of data, an efficient system for the storage, retrieval and analysis of this data was required. Hadoop itself has its own file system, the Hadoop Distributed File System which has been used for the purpose as mentioned above. The data for the project is stored in the HDFS. 4.2.2.1 Data Storage in HDFS cluster for analysis
The data obtained from the datasets is stored in the Hadoop distributed file system in blocks of size of 64MB. Each block of data has three copies and is stored on one of the racks. 4.2.3 Analysis of data and recommendation 4.2.3.1 InputFormat to select the data for input to MapReduce and define the InputSplits that break a file into tasks
The MapReduce framework operates exclusively on
pairs, that Is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. Input and Output types of a MapReduce job: (input) -> map -> -> combine -> -> reduce -> (output)
15
4.2.3.2 MapReduce Program to customize Data Set
The first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.
4.2.3.3 MapReduce Program to perform Correlation
For each row we calculate similarity by computing the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate the correlation between the movies. The correlation can be expressed as:
4.2.3.4 MapReduce Program to sort and Format Recommendations
The last step of the job that will sort the top-correlated items for each item and print it to the output. 4.2.3.5 Output Collector
Produce personalized recommendation for individual users according to user requirement using Apache HIVE.
16
5. DESIGN 5.1 Technical Environment
The Recommender System shall be deployed over the 2 node cluster and Java runtime was setup to use the common Hadoop configuration, as specified by the NameNode (master node) in the cluster.
5.2 Hierarchy of modules
Figure 5.1: Hierarchy of modules
5.3 Detailed Design
Figure 5.2: Use Case Diagram
17
Figure 5.3: Class Diagram
5.3.1 Hadoop Cluster Configuration
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. [13]
5.3.2 Data Storage and Replication
The Hadoop filesystem is designed for storing petabytes of a file with streaming data access using the idea that most efficient data processing pattern is a write-once, read many- times pattern. HDFS stores metadata on a dedicated server, called NameNode. Application data are stored on other servers called DataNodes. All the servers are fully connected and communicate with each other using TCP-based protocols.
18
5.3.2.1 Architecture
HDFS is based on master/slave architecture. A HDFS cluster consists of a single NameNode (as master) and a number of DataNodes (as slaves). The NameNode and DataNodes are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). The usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNodes software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment one machine usually runs one DataNode.The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. 5.3.2.2 Namenode
NameNode manages the filesystem namespace, metadata for all the files and directories in the tree.The file is divided into large blocks (typically 148 megabytes, but the user selectable file-by-file) and each block is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file) to provide reliability. The NameNode maintains and stores the namespace tree and the mapping of file blocks to DataNodes persistently on the local disk in the form of two files: the namespace image and the edit log. The NameNode also knows the DataNodes on which all the blocks for a given file are located. However, it does not store block locations persistently, since this information is reconstructed from DataNodes when the system starts.[14] On the NameNode failure, the filesystem becomes inaccessible because only NameNode knows how to reconstruct the files from the blocks on the DataNodes. So, for this reason, it is important to make the NameNode resilient to failure, and Hadoop provides two mechanisms for this: Checkpoint Node and Backup Node.
5.3.2.3 HDFS Client Reading a file To read a file, HDFS client first contacts NameNode. It returns list of addresses of the DataNodes that have a copy of the blocks of the file. Then client connects to the closest DataNodes directly for each block and requests the transfer of the desired block. Figure 7 shows the main sequence of events involved in reading data from HDFS.
19
Writing to a File
For writing to a file, HDFS client first creates an empty file without any blocks. File creation is only possible when the client has writing permission and a new file does not exist in the system. NameNode records new file creation and allocates data blocks to list of suitable DataNodes to host replicas of the first block of the file. Replication of data makes DataNodes in pipeline. When the first block is filled, new DataNodes are requested to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Each choice of DataNodes is likely to be different. If a DataNode in pipeline fails while writing the data then pipeline is first closed and partial block on failed data node is deleted and failed DataNode is removed from the pipeline. New DataNodes in the pipeline are chosen to write remaining blocks of data. 5.3.2.4 Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. 5.3.2.5 Replica Placement
Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks. The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFS‘s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter20
rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance. 5.3.2.6 Data Blocks
HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode. 5.3.3 Approach 5.3.3.1 Collaborative Filtering Item-based collaborative filtering is a model-based algorithm for making recommendations. In the algorithm, the similarities between different items in the dataset are calculated by using one of a number of similarity measures, and then these similarity values are used to predict ratings for user-item pairs not present in the dataset. Similarities between items The similarity values between items are measured by observing all the users who have rated both the items. As shown in the diagram below, the similarity between two items is dependent upon the ratings given to the items by users who have rated both of them:
Figure6.4: Collaborative Filtering
21
Similarity measures There are a number of different mathematical formulations that can be used to calculate the similarity between two items. As can be seen in the formulae below, each formula includes terms summed over the set of common users U . Cosine-based similarity Also known as vector-based similarity, this formulation views two items and their ratings as vectors, and defines the similarity between them as the angle between these vectors:
Our implementation We implemented item-based collaborative filtering using these parameters: Adjusted cosine-based similarity Minimum number of users for each item-item pair: 5
Number of similar items stored: 50 Challenges We tried item-based collaborative filtering on the Movielens dataset, but as the results page shows, it didn't perform very well in testing. In particular, we isolated two main problems, which were mainly due to the sparsity of the data:
The first problem manifested itself during adjusted-cosine similarity measurement calculation, in the case when there was only one common user between movies. Since we subtract the average rating for the user, the adjusted-cosine similarity for items with only one common user is 1, which is the highest possible value. As a result, for such items, which are common in the Movielens database, the most similar items end up being only these items with one common user. The solution we implemented was to specify a minimum number of users (in this case, 5) that two movies needed to have in common before they could be called similar. The second challenge arose when we used weighted sum to calulate the rating for test user-movie pairs. Since we were storing only 50 similar movies for each movie, and for each target movie, we only consider the similar movies that the active user has seen, it was often the case with the Movielens dataset that there weren't many such movies for many of the users. This resulted in bad predictions overall for large test sets. Because this was due to the sparsity of the dataset itself, we couldn't come up with a straightforward solution to this problem.
22
5.3.4 Hadoop MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner [8]. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. Inputs and Outputs The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (input) -> map -> -> combine -> -> reduce -> (output)
23
5.3.4.1 Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs. 5.3.4.2 Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via JobConf.setNumReduceTasks(int). Overall, Reducer implementations are passed the JobConf for the job via the jobConfigurable.configure(JobConf) method and can override it to initialize themselves. The framework then calls reduce ( WritableComparable,Iterator, OutputCollector, reporter) method for each pair in the grouped inputs. Applications can then override the Closeable.close() method to perform any required cleanup. Reducer has 4 primary phases: shuffle, sort and reduce. Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. is phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each pair in the grouped inputs.
5.3.4.3 Partitioner
Partitioner partitions the key space. Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction. 24
5.3.4.4 Output Collector
Output Collector is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer (either the intermediate outputs or the output of the job).
5.3.4.5 Job Configuration
JobConf represents a MapReduce job configuration. JobConf is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by JobConf, however, while some job parameters are straight-forward to set (e.g. setNumReduceTasks(int)), other parameters interact subtly with the rest of the framework and/or job configuration . It is more complex to set Partitioner, Reducer, Input Format, Output Format and Output Committer implementation s. Job Conf also indicates the set of input files (setInputPaths(JobConf, Path...) / addInputPath (JobConf,Path)) and (setInputPaths (JobConf, String) / addInputPaths(JobConf,String)) and where the output files should be written (setOutputPath (Path)).
25
5.4 Test Plan Serial No.
Testing Component details Test ID Description
1
2 T-CLU 1.1
Test Input
Desired Result Test Result Dead Node On Disconnection HDFS from Network Checking The Live Node connected Nodes of the Unresponsive Not Cluster and their Node, Slow Responsive or And HDFS Connection or Searching on Cluster arranged arrangement Processing HDFS GUI
3
Up & Running
5
Unresponsive Network
Check Storage Of Data In HDFS Cluster Node
6
T-HDF1.1
7
8
10
11
12
THDFS1.2
Updating Data
References
A.3
HDFS Write Error On GUI Data node
Single Copy checked Of Data and Being Stored On Closest Replication Empty Data Factor 1 Node 3 Copies Of Data Being Replication Factor Stored In accordance to maintained Rack Default Replication Awareness Alert for Adding New Data Nodes, Data Nodes Scalability Full with Data Feature
A.1 A.2
A.3
Update User Data On User Name Node Unsuccessful Replace File with Updated User Data Successful 26
Update Successful
No. of Words
Data Successfully updated
A.4
Test with Data Sets Program
in Key Value Pairs Form
T-MR1.1 Output Format/Result Error
13 Map Reduce Package
14
15
16.
Data sets A.5 Tested and key value pair created
T-MR1.2
MapReduce for Correlation
Corrupt Package
Successful Test With DataSet
Correlation produced within -1 to 1 Correlation A.6 Correlation produced with value created undefined(NaN) output
Unsuccessful Test
Output Format/Result Error
Successful Sorted and Test With desired output produced MapReduce to Sort DataSet T-MR1.3 and Format Output Unsuccessful Format/Result Test Error Pick a number The lower the of value, random users T-ALG Algorithm Test and the better items and try (Pen & Paper to predict the Calculation) rating using algorithm. Calculate the RMSE between prediction and the actual rating Table 5.1: Test Plan
27
Sorted output A.7 obtained
Based on Co-relation value Lower the value Better result obtained In sorted order
A.7
6. Implementation 6.1 Implementation Details 6.1.1 Cluster Configuration
Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions. Typically one machine in the cluster is designated as the NameNode and another machine the as ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load. The rest of the machines in the cluster act as both DataNode and NodeManager. These are the slaves. Administrators should use the etc/hadoop/hadoop-env.sh and optionally the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific customization of the Hadoop daemons’ process environment To start a Hadoop cluster you will need to start both the HDFS and YARN cluster.
6.1.2 Data Dtorage 6.1.2.1 Storage of Data in HDFS
Format the configured HDFS file system, open namenode (HDFS server), and execute the following command. $ hadoop namenode -format After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster. $ start-dfs.sh Listing Files in HDFS After loading the information in the server, we can find the list of files in a directory, status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a directory or a filename as an argument. 28
$ $HADOOP_HOME/bin/hadoop fs -ls Inserting Data into HDFS Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system. Create an input directory. $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input Transfer and store a data file from local systems to the Hadoop file system using the put command. $ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input Retrieving Data from HDFS Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the Hadoop file system. Get the file from HDFS to the local file system using get command. $ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
6.1.2.2 Determination of Block Size of the Data Set & Block Replication Factor
HDFS file is chopped up into 3 similar 64 MB chunks, and if possible, each chunk will reside on a different DataNode.
6.1.3 Analysis of Data And Recommendation
Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked. Using the correlation we can: For every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector. Then, calculate the correlation between these two vectors. Now when someone watches a movie, you can now recommend him the movies most correlated with it.
29
6.1.3.1 Input Format
InputFormat describes the input-specification for a Map-Reduce job.The Map-Reduce framework relies on the InputFormat of the job to: Validate the input-specification of the job. Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper. Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper .
6.1.3.2 MapReduce to Customize Dataset
First step is to get our movies file which has three columns: (user, movie,rating). For this task we will use the MovieLens Dataset. We use MapReduce to customize data set in required format. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps. 6.1.3.3 MapReduce to Perform Correlation
Each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate the correlation between the movies. 6.1.3.4 MapReduce to Perform Sort
Now the last step of the job that will sort the top-correlated items for each item and print it to the output. 6.1.3.5 Output Collector
Collects Required Output from the result dataset of Sort MapReduce Program and displays it to user.
30
6.2 System Installation Step
Before installing Hadoop into the Linux environment, we need to set up Linux using ssh (Secure Shell). Follow the steps given below for setting up the Linux environment. Installing Java Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in your system using the command “java -version”. The syntax of java version command is given below. $ java – version Step1: Download java (JDK - X64.tar.gz) Verify and extract the jdk7u71-linux-x64.gz Step2: To make java available to all the users, move it to the location “/usr/local/”. Step3: For setting up PATH and JAVA_HOME variables, add commands to ~/.bashrc file. Step4: Configure java alternatives and verify the java -version command from the terminal as explained above. Setting Up Hadoop
You can set Hadoop environment variables by appending the f ollowing commands to ~/.bashrc file. export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL=$HADOOP_HOME Now apply all the changes into the current running system. $ source ~/.bashrc
31
Hadoop Configuration
You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure. $ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs in java, you have to reset the java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in your system. export JAVA_HOME=/usr/local/jdk1.7.0_71 The following are the list of files that you have to edit to configure Hadoop. core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers. The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.
32
6.3 System Usage Instruction
After the installation of Apache Hadoop on the Master and Slave nodes, the following steps have to be performed in order to run the jar file. 1. Data Storage in HDFS
The data that has to be processed upon firstly exists in the local file system. It has to be transfered to the HDFS in order for the jar to operate upon it.It is done by typing the following into the terminal: $ hadoop fs -jar 2. JAR Execution
The JAR is our exectutable file which will operate upon the dataset that we had transferred to the HDFS. It is done by typing the following into the terminal: $ hadoop jar 3. Output Retrieval
The output produced of a MapReduce program is stored within the directory which has been set as the output folder. Within it, the final output of the MapReduce is kept. If it is just a Map, the filename containing the output will be named as part-m-00000. Else, If it is a MapReduce where the output is produced after the Reduce, the output file is named as part-r-00000. The output can be retreived in any of the following two ways: a. Using terminal Since the location of the output of the output file is known, it can be displayed in the terminal itself with the following command: $ hadoop fs -cat b. Using HDFS web-based file browser Hadoop comes with a web-based file browser, which provides a GUI to browse the HDFS. So, if the file location is known, the destination can be reached by navigating through on the browser. The URL for the browser is: http://localhost:50070/explorer.html#/ 33
7. Conclusion 7.1 Project Benefits
Recommender systems are a powerful new technology for extracting additional value for a business from its user databases. These systems help users find items they want to buy from a business. Recommender systems benefit users by enabling them to find items they like. Conversely, they help the business by generating more sales. Recommender systems are rapidly becoming a crucial tool in E-commerce on the Web. Recommender systems are being stressed by the huge volume of user data in existing corporate databases, and will be stressed even more by the increasing volume of user data available on the Web. New technologies are needed that can dramatically improve the scalability of recommender systems. In our project we have used collaborative filtering which promises of processing large data sets and at the same time produce high-quality recommendations. 7.2 Future Scope for improvements
The system implemented in the project uses static data to recommend books to the users. To incorporate dynamic data, distributed databases such as HBase or Cassandra can be used which can be regularly updated to add new users and ratings. To make the web application, the data needs to be accessible in real time. The solution to this too can be the use of a distributed database. The recommender system can be improved by combining user based collaborative filtering and content based filtering with the current system. This combination is also called Hybrid filtering and it helps in significantly performance improvement. The comparison made between the different similarity metrics was based on the run time and not on the precision of the recommendations.
34
8. References
[1] A. Felfering, G. Friedrich and Schmidt Thieme, Recommender systems, IEEE Intelligent systems, pages 18-21, 2007. [2] P. Resnick, N. Iacovou, M. Suchak, and J. Riedl, GroupLens: An Open Architecture for Collaborative Filtering of Netnews, In Proceedings of CSCW ‘94, Chapel Hill, NC, 1994. [3] U. Shardanand and P. Maes Social Information Filtering: Algorithms for Automating ’Word of Mouth’ , In Proceedings of CHI ‘95. Denver, 1995.
[4] Daniar Asanov, Algorithms and Methods in Recommender Systems, Berlin Institute of Technology, Berlin, 2011 [5] Badrul Sarwar , George Karypis and John Riedl, Item-Based Collaborative Filtering Recommendation Algorithms, IW3C2: Hong Kong, China,
2001
[Online].
http://wwwconference.org/www10/cdrom/papers/519/index.html [6] Francesco Ricci, Lior Rokach and Bracha Shapira, Introduction to Recommender System Handbook, New York: Springer Science+Buisness Media Ltd, 2011, ch. 1,
sec. 1.4, pp 10-14. [7] D. Borthakur and S. Dhruba, HDFS architecture guide, Hadoop Apache Project, 2008 [Online]. http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf [8] Dean, Jeffrey, and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM 51.1, 2008, pp 107-113. [9] Fuzhi Zhang, Huilin Liu, Jinbo Chao, “ A Two-stage Recommendation Algorithm Based on K-means Clustering In Mobile E-commerce”, Journal of Computational Information Systems, Vol. 6, Issue 10, pp. 3327-3334, 2010. [10] Taek-Hun Kim, Young-Suk Ryu, Seok-In Park, and Sung-Bong Yang, “ An 35
Improved Recommendation Algorithm in Collaborative Department of computer science yonsei university.
Filtering”,
[11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia and Robert Chansler, “The Hadoop Distributed File System ”, IEEE , pp. 978-1- 4244-7153-9/10, 2010. [12] Emmanouil Vozalis, Konstantinos G. Margaritis, “ Analysis of Recommender Systems‟ Algorithms ”, conference proceeding of IEEE. [13] Brian McFee, Luke Barrington and Gert Lanckriet, “ Learning Content Similarity for Music Recommendation” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 8, 2012. [14] Paul C.Zikopolus and Chris Eaton, Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data ”, thesis, 2013. “
[15] Chuck Lam, “ Hadoop in Action”, thesis, 2013.
36
APPENDIX A.1 Core-site.xml
A.2 localhost:54310
37
A.3 Hdfs-site.xml
A.4 U.data
38
A.5 MapReduce 1
A.6 MapReduce 2
39