By: Ram on Aug 04, 2020
One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is an exploratory data analysis technique.
List of reallife Clustering Applications
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). This is a nonhierarchical method of grouping objects together.
The objective is to minimize the withincluster variance
In this blog, we aim to explain the algorithm in simple steps and with an example.
We have height and weight features or variables. Using these two variables, we need to group the objects into 2 clusters.
If you look at the above chart, you will expect that there are two visible clusters/segments and we want these to be identified using the K Means algorithm.
Data Sample
Obs 
Height 
Weight 

1 
185 
72 

2 
170 
56 

3 
168 
60 

4 
179 
68 

5 
177 
62 

6 
188 
77 

7 
180 
71 

8 
180 
52 

9 
183 
84 

10 
180 
88 

11 
180 
67 

12 
177 
76 
K Means Clustering requires the value of K as inputs. For this example, we are considering the value of K as 2
In this example, the value of K is considered as 2. Cluster centroids are initialized with the first 2 observations.
Cluster 
Initial Centroid 

Height 
Weight 

K1 
185 
72 
K2 
170 
56 
Euclidean is one of the distance measures used on the K Means algorithm. Euclidean distance between each of the observations and initial cluster centroids 1 and 2 is calculated.
Obs 
Height 
Weight 
Squared Euclidean Distance Centroid 1 
Squared Euclidean Distance Centroid 2 
1 
185 
72 
(185185) **2+(7272) **2

(170185) **2+(5672) **2

2 
170 
56 
(185170) **2+(7256) **2

(170170) **2+(5656) **2

3 
168 
60 
(185168) **2+(7260) **2

(170168) **2+(5660) **2

4 
179 
68 
(185179) **2+(7268) **2

(170179)**2+(5668)**2 
5 
177 
62 
(185177) **2+(7262) **2

(170177)**2+(5662)**2 
6 
188 
77 
(185188) **2+(7277) **2

(170188)**2+(5677)**2 
7 
180 
71 
(185180) **2+(7271) **2

(170180)**2+(5671)**2 
8 
180 
52 
(185180) **2+(7270) **2

(170180)**2+(5652)**2 
9 
183 
84 
(185183) **2+(7284) **2

(170183)**2+(5684)**2 
10 
180 
88 
(185180) **2+(7288) **2

(170180)**2+(5688)**2 
11 
180 
67 
(185180) **2+(7267) **2

(170180)**2+(5667)**2 
12 
177 
76 
(185177) **2+(7276) **2

(170177)**2+(5676)**2 
Has the cluster assignment changed?
Yes, so continue for the next iteration.
Is there a change in the assignment?
NO – stop further iteration
A few important considerations in K Means
How do we decide the value of K in KMeans Clustering?
When a k means clustering project is being done, multiple values of k are considered. There are a few considerations to select the final clustering is selected.
Observation % in each of the clusters
R^2 value of each of the variable
Overall R^2 value and other clustering performance statistics e.g. CCC
Elbow method can help in identifying the number of clusters based on the sum of squared distance (SSE) value
In K Means Clustering, typically continuous variables are considered. Within continuous variables, the variable measurement scale can be significantly different. For example, Age values could vary from 0 to 100, but the salary variable takes values from 0 to hundreds of thousands.
In K Means clustering, one of the similarity measures used is Euclidean Distance. And Euclidean distance is calculated as follows.
Euclidean Distance Measure can be biased due to the scale of measurement. In this example, we wanted to explore how the scale of measurement can impact K Means clustering and assignment of objects to different clusters.
The maximum contribution of age could be (3320)2 =169 but even the smallest difference between Salary for the objects will be in thousands. So, the scale of measurement will have an impact in Euclidean Distance calculation or similarity measure.
Now, one of the ways to standardizing variables is to subtract by mean and then divide with standard deviation.
Now, we want to really see the impact of the scale in K Means clustering. So, we would want to run K Means with and without scaled variables.
Other tricks used for variable standardization are
Scenario 1: K Means Clustering in R without Variable Standardization
Scenario 2: K Means Clustering using R with Variable Standardization
Analysis and Conclusion
In this blog, we have learned the key concepts of K Means Algorithm. How does K Mean Clustering algorithm group objects together? Also, finding the right value of K is important. We have discussed both practical considerations and elbow methods to find the value of K in K Mean Clustering. In the last, we have discussed the role of variable standardization with an example.
Do you have any questions or comments? Do share and we will learn and improve together.