International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
Clustering Analysis & Study on Deriving Concept Based User Profile from Search Engine Logs Name: NagaValli Susarla Qualification: M.Tech College: Varadhaman College of Engineering Email ID:
[email protected]
Name: B.Deepthi Qualification:M.Tech College: Jaya Prakesh Narayan
Name:E.R.Aruna Designation:Assoc Prof College: College: Varadhaman College of Engineering
Name:B.V.Rama Krishna Designation:Assoc Prof College: College:VaradhamanCollege of Engineering
AbstractThe concept of user profiling is a fundamental component of any personalization applications. Most existing user profiling strategies are based on objects that users are interested in i.e. positive preferences but not the objects that users dislike i.e. negative preferences. In this paper, we focus on search engine personalization and implementation of several concept-based user profiling methods that are based on both positive and negative preferences. We analyzed the proposed methods against our previously proposed personalized query clustering method. Our analysis results show that profiles which capture and utilize both of the user’s positive and negative preferences perform the best. An important result from our research is that profiles with negative preferences can increase the separation between similar and dissimilar queries. The separation provides a clear threshold for an agglomerative clustering algorithm to terminate and improve the overall quality of the resulting query clusters. Our works shows the effective analysis compare to other and existing one.
Keywords – Search Engine, User Profile, Clustering, Positive & Negative Preferences
INTRODUCTION I Information explosion on the internet has placed high demands on search engines. People are far from being satisfied with the performance of the existing search engines, which often return thousands of documents in response to a user query. Many of the returned documents are irrelevant to user’s need. The precision of current search engines is well under people’s expectations. To find more precise answers to a query a new generation of search engines, question answering systems have appeared on the web (e.g http://www.askgoogle.com/). http://www.askgoogle.com/) . Unlike the traditional search engines that only use keywords to match documents, this new generation of system tries to match documents this new generation of systems tries to understand the users question and suggest some similar questions that other people have often asked and for which the system has the correct answers. In fact the correct answers have been prepared or checked by human editors in most cases then guarantee that if one of the suggested questions is truly similar to that of the user, the answers provided by the system will be relevant. The assumption behind such a system is that many people are interested in the same questions – the frequently asked. User profiling strategy is an essential and fundamental component in search engine personalization. Most personalization methods focused on the creation of one single profile for a user and applied the same profile to all of the user’s queries. We believe that different queries from a user should be handled differently because a user’s preferences may vary across queries. For example, a user who prefers information about fruit on the query “orange” may prefer the information about Apple Computer for the query “apple.” Personalization strategies such as [1], [2] employed a single large user profile for each user in the personalization process. To improve user’s search experience, most major commercial search engines provide query suggestions to help users formulate more effective queries. When a user submits a query, a list of terms that are semantically related to the submitted query is provided to help the user identify terms that he/she really wants, hence improving the retrieval effectiveness. Yahoo’s “Also Try” [6] and Google’s “Searches related to” features provide related queries for narrowing search, while Ask Jeeves [1] suggests both more specific and more general queries to the user. Unfortunately, these systems provide the same suggestions to the same query without considering users’ specific interests.
62
International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
SECTION II 2.1. What are Search Engines? The term “search engine” is often used generically to describe both crawler-based search engines and human-powered directories. directori es. These two types of search engines gather their listings in radically different ways. Crawler-based Crawler-base d search engines, such as Google, create their listings listin gs automatically. They crawl the web, and then people search through what they have found. If there is any change on the web pages, crawler-based search engines eventually find these changes, and that can affect on the listing. Some search engines also mine data available in news books, databases, or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input. 2.2. How Search Engines Works? Crawler-based search engines [7] have three major elements. First one is spider or crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being “spidered” or “crawled.” The spider returns to the site on a regular basis, such as every month or two, to look for changes. Everything the spider finds goes into the second part of the search engine, the index The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been “spidered” but not yet “indexed.” Until it is indexed -- added to the index -- it is not available to those searching with the search engine. Search engine software is the third part of a search engine. This is the program that searches through the web pages recorded in the index to find matches to a search and then rank them in order of what it believes is most important. 2.3. How Search Engines Rank Web Pages ? To Search for anything search engine use crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and present you with ones that match your topic [10]. The matches will even be ranked, so that the most relevant ones come first. Unfortunately, search engines [2] don’t have the ability to focus the search. They also can’t rely on judgment and past experience to rank web pages, in the way humans can. So to determine relevancy they follow a set of rules, known as an algorithm. Exactly how a particular search engine’s algorithm works is a closely-kept trade secret. However, all major search engines follow the general rules as mentioned below: SECTION III 3. Hierarchical clustering: Hierarchical models [8], [9] propose a better versatility as they do not require any priori definition of the number of clusters to find. It is deterministic in nature and produces the same clustering each time. A perfect visualization of a dendrogram is provided. In hierarchical clustering, the data are not partitioned into a particular cluster in a single step. The rules [4] which govern between which points distances are measured to determine cluster membership include Complete-link (or complete linkage) we merge in each step the two clusters whose merger has the smallest diameter (or the two clusters with the smallest maximum pairwise distance). dist(Ci, C j) = max {dist (o i, o j) | oi ε Ci , o j ε C j } Single-link (or single linkage) we merge in each step the two clusters whose two closest members have the smallest distance (or the two clusters with the smallest minimum pairwise distance). dist(Ci ,C j) = min {dist (o i, o j) | oi ε Ci , o j ε C j } Average-link clustering merges in each iteration the pair of clusters with the average distance parameter. dist(C i ,C j) = mean{dist (oi, o j) | oi ε Ci , o j ε C j }
Figure 1: Rules for member ship Ward's method [8]: This method is distinct from all other methods because it uses an analysis of variance approach to evaluate
the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. In general, this method is regarded as very efficient, however, it tends to create clusters of small size. A series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object and the Hierarchical clustering is based on the union between the two nearest clusters. Given a set of N items to be clustered, and an N*N distance (or similarity) matrix. Hierarchical clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects
63
International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
successively into finer groupings. For both the hierarchical methods, a hierarchy of tree-like structure is constructed which is known as dendrogram or tree graph that is depicted for 2 clusters as shown in figure 2 below.
Figure 2 is dendrogram dendrogram Hierarchical algorithms like AGNES, DIANA and MONA construct a hierarchy of clusters, with a number of clusters ranging from one to the number of observations. observation s. The space complexity for hierarchical algorithm is O(n 2), this is the space required for the adjacency matrix. The space required for the dendrogram is O(Kn), which is much less than O(n 2). The time complexity for hierarchical algorithms is O(kn 2) because there is one iteration for each level in the dendrogram. Depending on the specific algorithm this could actually be O(maxd n 2) where maxd is the maximum distance between the points. Different algorithms may actually merge the closest clusters from the next lowest level or creates new clusters at each level with progressively larger distances. 3.1. Agglomerative Algorithm Agglomerative algorithms [9], [8] start with an individual item in its own cluster and iteratively merge cluster until all items belong to one cluster. All agglomerative approaches experience excess time and space constraints. The space required for the adjacency matrix is O(n 2) where there are n items to cluster. Agglomerative approaches are not incremental as when new elements are added or changed, the entire algorithm must be rerun. Agglomerative Nesting (AGNES) (AGNES) [8]: The algorithm constructs a tree-like hierarchy which implicitly contains all values of k , starting with N clusters and proceeding by successive fusions until a single cluster is obtained with all the objects. It initializes n sub cluster with one vector each and computes the triangular distance matrix. This will be repeated n times. The smallest distance of i and j are identified and the sub clusters i, j are merged into a new sub cluster (i, j).The new sub cluster is replaced with (i, j) and the matrix distance is changed which results in a hierarchical structure as shown in
Figure 3: Sub cluster merging The sub cluster is merged by deleting the rows and columns i, j and adding a row for(i, j) with new distances. The distances can be known by single, complete, average linkages. 3.2. Divisive analysis (DIANA): With divisive clustering all items are initially placed in one cluster and clusters are repeatedly split in two until all items are in their own cluster. The idea is to split up clusters where some elements are not sufficiently close to other elements. The method DIANA can be used for the same type of data as the method AGNES. Although AGNES and DIANA produce similar output, DIANA constructs its hierarchy in the opposite direction, starting with one large cluster containing all objects. At each step, it splits up a cluster into two smaller ones, until all clusters contain only a single element. This means that for N objects, the hierarchy is built in N-1 steps. In the first step, the data are split into two clusters by making use of dissimilarities. In each subsequent step, the cluster with the largest diameter is split in the same way. After N-1 divisive steps, all objects are apart. 3.3. Monothetic Analysis (MONA) Most other hierarchical methods including AGNES and DIANA are polythetic. MONA operates on a data matrix with binary variables. For each split MONA [8] uses a single variable for which it is called as Monothetic method. The MONAalgorithm constructs a hierarchy of clustering initially with a large one cluster. Clusters are divided until all observations in the same cluster have identical values for all variables. At each stage, all clusters are divided according to the values of one variable. A cluster is divided into one cluster with all observations having value 1 for that variable, and another cluster with all observations having value 0 for that variable as shown in figure 4. The variable used for splitting a cluster is the variable with the maximal total association to the other variables, according to the observations in the cluster to be splitted
64
International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
A 0
A,C,D C,D
A,C,D,H H
A,B,C,D, E,F,G,H 1
B
B,F F
B,E,F,G E,G
E G
1
2
3 0 Figure 4 : MONA with binary variables The comparisons [4] of hierarchical algorithms as in the presented paper are shown in the Table 1 below based on time and space complexity.
Table 1 : Complexity comparison of hierarchical algorithms 3.4. CURE (Clustering using representatives) It adopts an agglomerative scheme and shares certain motivation with BIRCH [10], [11] algorithm. It is an improvement over single link Clustering algorithm. CURE selects several scattered points carefully as representatives for each cluster and shrinks these these representatives representatives towards the centroid centroid in order to eliminate eliminate the effect and outliers chaining chaining effect. The distance between two clusters in CURE is defined to be the minimal distance between the two representatives of each cluster. In each iteration, it merges the closest clusters. The problem of computing set of representatives for each sub cluster can be solved as follows. It begins with centroid of the new clusters and finds the farthest data object from the centroid. This is the first representative point. In the subsequent iterations, a point in the sub cluster is chosen that is farthest from the previously chosen representative. The points are shrunk towards the centroids by a fraction α. This way shrinking leads to getting over the pro blem of outliers. It is one of the most successful clustering methods for clustering the data of any shapes, however it is computationally intensive. The time complexity is O(n 2 log n) while space is O(n).Performance experiments compared CURE to BIRCH and found quality of clusters better by CURE. The results with large datasets indicate that CURE scales well and outperforms BIRCH.
SECTION IV 4. User profiling is a fundamental component of any personalization applications. Most existing user profiling strategies are based on objects that users are interested in (i.e., positive references), but not the objects that users dislike (i.e., negative preferences). In this paper, we focus on search engine personalization and develop several concept-based user profiling methods that are based on both positive and negative preferences. 4.1. Personalized search: In this search understand the user preferences i.e content-based, behavior-based behavior-based methods like category interest and click through, browsing history here we need to adjust the searching results by filtering the results of user profile and re-rank the retrievals from search engine. In the searching the data in search engine based on query why we require clustering? If the similar queries have the same information need how to cluster click through data analysis apply the method Wen et al 2001
65
International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
Figure 5 shows the flow query processing in user profile search engine Clustering algorithm is based on similarity function, the method is similarity function : Similarity = a * Sim keyword + b * Simfeedback This cluster similarity work based on keyword type from query given by the user feedback is clicked documents set by query. The result of clustering is a global map which is a set of document pairs
Figure 6 shows the global set of documents 4.2. How to create group profile First choose N user profile, pick N 1/2 profiles randomly as leader Compute similarity and choose the nearest leader adjustment, remove small one Problems not incremental initial N user profile choosing to identify the group for a user is calculate the distance between the user and group which is nearest one. 4.3. Solution to our Proposed system: we address the our problem by proposing and studying seven concept-based user profiling strategies that are capable of deriving both of the user’s positive and negative preferences. The entire user profiling strategies is query-oriented, meaning that a profile is created for each of the user’s queries. The user profiling strategies are evaluated and compared with our previously proposed personalized query clustering method. Experimental results show that user profiles which capture both the user’s positive and negative preferences perform the best among all of the profiling strategies studied. Moreover, we find that negative preferences improve the separation of similar and dissimilar queries, which facilitates an agglomerative clustering algorithm to decide if the optimal clusters have been obtained. We show by experiments that the termination point and the resulting precision and recalls are very close to the optimal results.
SECTION V 5.1. User profile: A user profile is simple profile when used in-context is a collection of personal data associated to a specific user. A profile refers to the explicit digital representat ion of a person's identity. A user profile can also be considered as the computer representation of a user specific model. A profile can be used to store the description of the characteristics of person. This information can be exploited by systems taking into account the persons' characteristics and preferences. For instance profiles can be used by hypermedia systems that personalize the human computer interaction. 5.2. User profile extraction: User profile extraction from query evaluation engine The task of the Query Evaluation Engine (QE) is to determine the type of the user online and compute Recommendations based on the recent actions of that user. The decision is based on the knowledge attained from the existing logs and page ranking as in Fig .1.No user is attached to a particular outcome, outcome, as users are free to move from one locality to another. However, with carefully designed user profile extraction, the recommendation covers the majority of the user session. The pages are ranked using a Vis algorithm (1) [1]:
66
International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
Fig7. User Profile Profile System System
In the formula, Marks (i) represents total marks obtained by the student, Grade value (i) represents the number of hits for a page during Marks (i), p is a probability value between 0 and 1 and can be predefined. Thus, only the nearest past will play a crucial role in ranking. 5.3. Comparative study on Clustering Hierarchical clustering [1] are computationally expensive, works on relatively small data sets and require much memory space, though CURE is able to give good clustering results but it is costly in computation overhead. Nice visualization, because of dendrograms are possible. It is deterministic in nature and cannot correct early mistakes as split or merge leads to next step. A major drawback of existing hierarchical scheme like CURE is that the merging decisions are based on static model of the clusters to be merged. Partitional clustering run in linear time with better efficiency, tendency to result in local minimum, and dependency on the input order. The clustering quality is however not good as that of hierarchical algorithm. They are computationally efficient for large data sets. As they have predefined no. of clusters, optimization is difficult. Non-deterministic, should be run several times. There is always room for iterative improvement. Creates cluster in one step as opposed to several steps in hierarchical clustering. CONCLUSION V User profile can improve a search engine’s performance by identifying the information needs for individual users. In this paper, we proposed several user profiling strategies. The techniques make use of click through data to extract from Websnippets to build concept-based user profiles automatically. We applied preference mining rules to infer not only users’ positive preferences but also their negative preferences, and utilized both kinds of preferences in deriving users profiles. The user profiling strategies were evaluated and compared with the personalized query clustering method that we proposed previously. Our analysis show that profiles capturing both of the user’s positive and negative preferences perform the best among the user profiling strategies studied. Apart from improving the quality of the resulting clusters, the negative preferences in the proposed user profiles also help to separate similar and dissimilar queries into distant clusters, which helps to determine near optimal terminating points for our clustering algorithm. Reference [1] Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas. “Link Analysis Ranking: Algorithms, Theory and Experiments.” ACM Transactions on Internet Technology (TOIT) 5:1 (2005), 231–297. [2] Franklin, Curt. “How Internet Search Engines Work”, 2002. , Donald Wunsch II , Survey of Clustering Algorithms, IEEE in Neural Networks 16(3) (2005) [3] Rui Xu, , [4] Combining Partitional and Hierarchical algorithms for robust and efficient data clustering with cohesion self merging IEEE transactions on knowledge and data engineering 17(2) (2005) 145-159 [5] Venkatesh Ganti,Johannes Gehrke,Raghu Ramakrishnan DEMON-Mining & monitoring evolving data IEEE transactions on knowledge and data engineering 13(1) (2001) [6] Karl-Heinrich Anders, A Hierarchical Graph-Clustering Approach to find Groups of Objects, IEEE [7] J.Han and M.Kamber, Data Mining: Concepts and Techniques Morgar Kauffmann (2000) 21-25
67
International Journal of Computational Intelligence Intelligence and Information Security, August 2011 Vol. 2, No. 8
[8] Venkatesh Ganti,Johannes Gehrke,Raghu Ramakrishnan DEMON-Mining & monitoring evolving data IEEE transactions on knowledge and data engineering 13(1) (2001) [9] Lamia fhalthu ibrahim by Using of clustering approach for rural network planning third international conference on infotech (2005)pp 1-5 [10] J.Han and M.Kamber, Data Mining: Concepts and Techniques Morgar Kauffmann (2000) 21-25 [11] Kaufmann, L. & Rausseew clustering by means of k-medoids north holland Amsterdam (1987) 405-416
NagaValli Susarla pursuing M.Tech from Varadhaman College of Engineering her areas of interest include Networks, Semantic Web Data Mining & warehousing currently focusing on Clustering
E.R. Aruna Asso Prof from Varadhaman College of Engineering M.Tech from Padamavathi Mahila University areas of include Data Mining & Warehousing
B.Deepthi pursuing M.Tech from Jaya Prakesh Narayan College of Engineering her areas of interest include Networks, Data mining & Warehousing
Mr B.V.Rama Krishna Assoc Prof M.Tech from Punjab University currently he is the head of department at Ashok Institute of Technology, Hyderabad. His areas of interest include Data Mining, Information Security.
68