Tuesday, April 26, 2011

K-means Clustering

One of the biggest advantages of being an engineer is that we can apply concepts that we have learnt in our syllabus to virtually any problem in the world, and try to come up with new and innovative ideas to deal with these problems. We see a number of examples of engineers coming up with bizarre applications. Why refer to other engineers. Me and Anant Devi have already come up with Datar-Devi Approximations for Exams. A cousin of mine, by profession a computer geek, by passion a world class singer came up with another of such fantastic research of the relation between the frequency of your arm swing while walking and your velocity. Amazing stuff. Really, there are no boundaries for us engineers. All we need to do is think a little bit out of the box. Today, I shall be discussing one of the more important algorithms that we have in signal processing (esp speech processing)-the K-means algorithm and its application to one of the major headache for all college guys- the right girl.

Now, engineers, or for that matter most of the guys are hopeless when it comes to selecting the right girl for them. They just fall in love without thinking! (I know Ill be called a hypocrite- I too have been down this path on multiple occasions. MadB will be the one who will accuse me first, but yeah, who takes her seriously when she accuses!) And then they end up fighting with the girl over petty issues, and then act like crybabies.  What guys need is patience. This post is dedicated to all guys who are eternally confused as to what they will do with their love life and provide with insight of finding the best possible partners.

Let me introduce you to our quirky and extremely challenging to implement in matlab friend of ours, The K-means algorithm, which I will use to address your woes. To the unwary and uneducated, K-Means is a clustering algorithm that we commonly use for vector quantization. A given n dimensional ( n corresponding to various parameters) vectors when quantized into a data of say n bits is mapped to its nearest encoded bitstream which gives minimum mean squared error ( the message to which it is encoded is called the centroid) . One problem is- how do you assign the centroid. Well, it is simple. Its nothing but a mean of all the vectors in that cluster. So, dynamically, the centroid will keep on changing until all entities converge to a single point.
Mathematically, the K-means algorithm will be given by

Given a set of observations (x1x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ nS = {S1S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):

\underset{\mathbf{S}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2
where μi is the mean of points in Si.


Pictorally, this can be represented as:

Now, coming back to the notion of the application of K-means for the problem to be addressed. Here, in k-means we have a n dimensional vector. Let these n dimensions be all the points you will look for in the girl. (My vector, is 20 dimensional.) as shown in the figure above, make three categories- Girls who you like.
Girls who you do not like. and Girls who you think are OK. Each girl will have her own unique n dimensional vector. Assign a self defined centroid for all three of them.

As you grow older through your engineering years you come across a number of girls, not necessarily in your college, who you may have considered as your potential ideal match, or as your worst possible match. This will give you quite a bit of database for statistical analysis. When you see a girl in either category, observe them for a few days, and adjust their vectors, and keep them updating. after a sufficient amount of time, the vectors will have reached a saturation point and will not alter. this is when you apply the k-means.

now, it is highly unlikely that we will get a girl who coincides with our self defined centroids. If that happens to be your case, you are one hell of a lucky ba***d. But for people like me, who have some impossible conditional vector entries, this is almost impossible. Also, it is highly improbable that you will have a girl with some qualities from girl B who has some desirable qualities and girl C who has other desirable qualities. Lets face it- its unlikely. So, the best solution is to find the "vector" which will give you the least mean squared error at the output of the K-means. The girl corresponding to that vector is the one you go for!

Just to conclude:

If you like a girl, you have to wait for some time to be absolutely sure that she is the one that is perfect for you. Observe. Make mental notes of all the small things she does. (ofcourse this advise is only for those who are looking long term.. those interested in short term can jump ships whenever they want!). Make a mental note of what you want your ideal girl to be like beforehand. Ive received a lot of flack for my ideal girl, who can be found here, but nevertheless, do it.  Jot down all the points you want in all of them, and get ready to run the K-means algorithm.It works!!

(Personally, I dont recommend this approach- I havent done it for myself, and probably will not do it. Im more of a hopeless old school- fall in love arbitrarily kind of person to try this out. I am not suffering that much, and I am happy with the things as of now. No need for me then! This analogy was made purely to showcase the versatility that an engineer can adopt to make even the simplest of the things mathematical and complex. )


  1. Good research for the most unknown variable- the right Girl...

  2. Good one re...shall keep your stats in mind :P..though I still feel the previous one about maths of an exam was better.

  3. A very good analogy for k-means algorithm