Assignment Clustering ROCK Clustering Algorithm
Input:
A set S of data points
Number of k clusters to be found
The similarity threshold
Output: Groups of clustered data The ROCK alorithm alorithm is di!ided di!ided into three three ma"or parts: #$ Draw a random sample from the data set: %$ Perform a hierarchical hierarchical agglomerative clustering algorithm &$ Label data on disk
in our case' (e do not deal (ith a !ery hue data set$ So' (e (ill consider the (hole data in the process of formin clusters' i$e$ (e skip step# and step&
#$ Draw a random sample from the data set:
Samplin is used to ensure scalability to !ery lare data sets The initial sample sample is used to form clusters' clusters' then the the remainin remainin data on disk disk is assined to these clusters
in our case' (e (ill consider the (hole data in the process of formin clusters$ %$ Perform a hierarchical agglomerative clustering clustering algorithm :
ROCK performs the follo(in steps (hich are common to all hierarchical alomerati!e clusterin alorithms' alorithms' but (ith di)erent de*nition to the similarity measures: a$ places places each sinl sinle e data point point into into a separate separate cluste clusterr b$ compute compute the similar similarity ity measure measure for for all pairs pairs of clusters clusters c$ mere the t(o t(o clusters clusters (ith (ith the hihest similarity +oodness measure, d$ -erify erify a stop condit condition$ ion$ If If it is not met met then o to to step b
&$ Label data on disk:
.inally' the remainin data points in the disk are assined to the enerated clusters$
This is done by selectin a random sample /i from each cluster Ci' then (e assin each point p to the cluster for (hich it has the stronest linkae (ith /i$
As (e said' (e (ill consider the (hole data in the process of formin clusters$
Computation of links:
#$ usin the similarity threshold θ' (e can con!ert the similarity matri0 into an ad"acency matri0 +A, %$ Then (e obtain a matri0 indicatin the number of links by calculatin +A 0 A , ' i$e$' by multiplyin the ad"acency matri0 A (ith itself &$ Suppose we have four verses contains some subjects , as follows: P1= judgment, faith, pra!er, fair" #ilk, butter, bread, eggs" P$= fasting, faith, pra!er" P%= fair, fasting, faith"
hone!, butter, bread" eggs, hone!, butter"
P&= fasting, pra!er, pilgrimage" hone!, bread, 'am" 1$ the similarit! threshold = ()%, and number of re*uired cluster is $) 2$ using 'accard coe+cient as a similarit! measure, we obtain the following similarit! table :
Since (e ha!e a similarity threshold e3ual to 4$&' then (e deri!e the ad"acency table:
5y multiplyin the ad"acency table (ith itself' (e deri!e the follo(in table (hich sho(s the number of links +or common neihbors, :
(e compute the oodness measure for all ad"acent points 'assumin that f θ - =1.θ / 10θ
g ( P i , P j ) =
link [ P i , P j ] ( n + m)1
+
2 f (θ )
−
n1
+
2 f (θ )
−
m1
+
2 f (θ )
6e obtain the follo(in table
:
+7#'7%,8 & 9 +##,%$4; < #%$4; < #%$4; 8 #$&=
+7#'71,8 # 9 +##,%$4; < #%$4; < #%$4; 8 4$1;
6e ha!e an e3ual oodness measure for merin ++7#'7%,' +7%'7#,' +7&'7#,,
No(' (e start the hierarchical alorithm by merin' say 7# and 7%$
A ne( cluster +let>s call it C+7#'7%,, is formed$
It should be noted that for some other hierarchical clusterin techni3ues' (e (ill not start the clusterin process by merin 7# and 7%' since Sim+7#'7%, 8 4$1'(hich is not the hihest$ 5ut' ROCK uses the number of links as the similarity measure rather than distance$
No(' after merin 7# and 7%' (e ha!e only three clusters$ The follo(in table sho(s the number of common neihbors for these clusters:
etc$
Then (e can obtain the follo(in oodness measures for all ad"acent clusters:
It should be 7&'718 4$12
++7#'7%,'7&,8 ; 9 +%#,%$4; < %%$4; < #%$4; 8 #$&1 or #$
++7#'7%,'71,8 & 9 +%#,%$4; < %%$4; < #%$4; 8 4$;= or 4$;;
Since the number of re3uired clusters is %' then (e *nish the clusterin alorithm by merin C+7#'7%, and 7&' obtainin a ne( cluster C+7#'7%'7&, (hich contains ?7#'7%'7&@ lea!in 71 alone in a separate cluster$