Ramgopal Prajapat:

Learnings and Views

K Means Clustering Algorithm: Explained

By: Ram on Aug 04, 2020

One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is an exploratory data analysis technique.

List of real-life Clustering Applications

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). This is a non-hierarchical method of grouping objects together.

The objective is to minimize the within-cluster variance

In this blog, we aim to explain the algorithm in simple steps and with an example.

K Means Clustering Algorithm Steps:

  • Step1: Randomly assign k observations as cluster centers/ centroids for K Clusters
  • Step2: Continue assignment of clusters & updating cluster centroids until no change
    • Calculate Square Euclidean Distance between each of the observations and Cluster Centroids
    • Reassign based on least distance
    • Update cluster centroids based on the revised assignments

Business Scenario

We have height and weight features or variables. Using these two variables, we need to group the objects into 2 clusters.

If you look at the above chart, you will expect that there are two visible clusters/segments and we want these to be identified using the K Means algorithm.

Data Sample

Obs

Height

Weight

 
 

1

185

72

 

2

170

56

 

3

168

60

 

4

179

68

 

5

177

62

 

6

188

77

 

7

180

71

 

8

180

52

 

9

183

84

 

10

180

88

 

11

180

67

 

12

177

76

 

K Means Clustering requires the value of K as inputs. For this example, we are considering the value of K as 2

Iteration 1

1.0: Initialize cluster centroid

In this example, the value of K is considered as 2.  Cluster centroids are initialized with the first 2 observations.

Cluster

Initial Centroid

Height

Weight

K1

185

72

K2

170

56

 

1.1: Calculate Euclidean Distance between each of these 2 cluster centroids and each of the observations


Euclidean is one of the distance measures used on the K Means algorithm. Euclidean distance between each of the observations and initial cluster centroids 1 and 2 is calculated.

Euclidean Distance

Distance Calculations: We will use squared Euclidean Distance for the assignment.

Obs

Height

Weight

Squared Euclidean Distance- Centroid

1

Squared Euclidean Distance- Centroid

2

1

185

72

(185-185) **2+(72-72) **2

 

(170-185) **2+(56-72) **2

 

2

170

56

(185-170) **2+(72-56) **2

 

(170-170) **2+(56-56) **2

 

3

168

60

(185-168) **2+(72-60) **2

 

(170-168) **2+(56-60) **2

 

4

179

68

(185-179) **2+(72-68) **2

 

(170-179)**2+(56-68)**2

5

177

62

(185-177) **2+(72-62) **2

 

(170-177)**2+(56-62)**2

6

188

77

(185-188) **2+(72-77) **2

 

(170-188)**2+(56-77)**2

7

180

71

(185-180) **2+(72-71) **2

 

(170-180)**2+(56-71)**2

8

180

52

(185-180) **2+(72-70) **2

 

(170-180)**2+(56-52)**2

9

183

84

(185-183) **2+(72-84) **2

 

(170-183)**2+(56-84)**2

10

180

88

(185-180) **2+(72-88) **2

 

(170-180)**2+(56-88)**2

11

180

67

(185-180) **2+(72-67) **2

 

(170-180)**2+(56-67)**2

12

177

76

(185-177) **2+(72-76) **2

 

(170-177)**2+(56-76)**2

1.2: Cluster Assignment

1.3: Update Cluster Centroid

Has the cluster assignment changed?

Yes, so continue for the next iteration.

Iteration 2

2.1: Distance Calculation

2.2 Cluster Assignment

2.3 Cluster Centroids

Is there a change in the assignment?

NO – stop further iteration

 

Final Clusters

 

 A few important considerations in K Means

  • The scale of measurements influences Euclidean Distance, so variable standardization becomes necessary
  • Depending on expectations - you may require outlier treatment
  • K Means clustering may be biased on initial centroids - called cluster seeds
  • Maximum number of clusters is an input parameter and may also impact the clusters getting created

 

How do we decide the value of K in K-Means Clustering?

When a k means clustering project is being done, multiple values of k are considered. There are a few considerations to select the final clustering is selected.

Observation % in each of the clusters

R^2 value of each of the variable

Overall R^2 value and other clustering performance statistics e.g. CCC

Elbow method can help in identifying the number of clusters based on the sum of squared distance (SSE) value

 

How do we standardize variables in K Means Clustering?

In K Means Clustering, typically continuous variables are considered. Within continuous variables, the variable measurement scale can be significantly different.  For example, Age values could vary from 0 to 100, but the salary variable takes values from 0 to hundreds of thousands.

In K Means clustering, one of the similarity measures used is Euclidean Distance. And Euclidean distance is calculated as follows.

Euclidean Distance Age and Salary

Euclidean Distance Measure can be biased due to the scale of measurement. In this example, we wanted to explore how the scale of measurement can impact K Means clustering and assignment of objects to different clusters.

K Means Data

The maximum contribution of age could be (33-20)2 =169 but even the smallest difference between Salary for the objects will be in thousands. So, the scale of measurement will have an impact in Euclidean Distance calculation or similarity measure.

Now, one of the ways to standardizing variables is to subtract by mean and then divide with standard deviation.

Standardization

Now, we want to really see the impact of the scale in K Means clustering. So, we would want to run K Means with and without scaled variables.

Other tricks used for variable standardization are

  • Percentile Ranks
  • (X - min X)/(max X - min X)

Scenario 1: K Means Clustering in R without Variable Standardization

k means without variable standardization

Scenario 2: K Means Clustering using R with Variable Standardization

k means with variable standardization

Analysis and Conclusion

  • Though from graphical /visualization ( which takes care of scaling), it was evident that observations 2,3 and 5 should be clubbed together, when K Means clustering without variable standardization is used, the observation 3, and 4 are put into one cluster.
  • Though Age values for observations 3 and 4 are significantly different, the importance was given to Salary which has a high scale of measurement.
  • In scenario 2, we have standardized the variables - Age and Salary. Now both of these variables have the same scale of measurement.
  • In scenario 2, observations 1 and 4 are considered into one cluster and observation 2,3 and 5 into the second cluster. This looks in-line with expectations and visualization as well.
  • It can be concluded that the scale of measure plays an important role in K Means clustering and variable standardization should be considered to suppress the effect of measurement scale.

Concluding thoughts

In this blog, we have learned the key concepts of K Means Algorithm. How does K Mean Clustering algorithm group objects together? Also, finding the right value of K is important. We have discussed both practical considerations and elbow methods to find the value of K in K Mean Clustering. In the last, we have discussed the role of variable standardization with an example. 

Do you have any questions or comments? Do share and we will learn and improve together.

Leave a comment