Thursday, October 29, 2009

K-Means: algo to cluster

K-means is most important flat clustering algo. There has lots of research on K-means in recenpast and still leaves a lot of room for research. Before understanding what a K-means is , one should know why and when it is done.
Why: When we have a problem statement that specifies us to create a set of documents {d1,d2....dN}that has to be clustered and a desired number of cluster =K along with an objective function to minimize the average squared Euclidean distance of documents from their cluster center . So what we have is N set of documents ... can be N webpages , with already specified K number of clusters to be made out of them ,we also have an objective that would lead us to clusters having documents with least dissimilarity (or distance). Now to establish the center of our cluster from where we take take distance. It can simply be defined by equation:
where we have vector μ as the centroid and have documents as length normalized vectors. For vector normalizing visit:http://www.fundza.com/vectors/normalize/index.html,
Now ever creating a cluster by K-means one should remember that we should test the measure of how well centroid is for the particular cluster . This can easily be found by RSS (Residual Sum of Squares), the squared distance of each vector from its centroid.


Now the real algorithm to cluster the documents can be presented


The arguments supplied to the function are K: the number of clusters to be made and the set of documents . The first step involves randomly selected seeds(an area of debate and discussion),a set of K documents from N. Then we actually move the cluster's center in order to minimize RSS(Residual Sum of Squares) . There is a while loop which runs until stopping criterion has been met(discussed later). Merging of clusters is done according to a defined (arg min) step(8-9) , whose computation is based on distance between the document and the centroid. Each time the merge is done re-computation of centroid is done. step(10-11).
Now we need to consider stopping criteria :Well as normal fix a number of iteration , or keep doing until you find centroids are not changing.. A hybrid can help as is former we may unnecessarily be doing computations even if we have already found out the value and on the later we could have been calculating the value for an infinite time or to a level that we don't afford , A sensible approach could be to fix a threshold RSS which gets us to fix a standard .

No comments:

Post a Comment