## K Means Clustering Algorithm: Explained

By: Ram on Aug 04, 2020

One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is an exploratory data analysis technique.

List of real-life Clustering Applications

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). This is a non-hierarchical method of grouping objects together.

The objective is to minimize the within-cluster variance

In this blog, we aim to explain the algorithm in simple steps and with an example.

## K Means Clustering Algorithm Steps:

• Step1: Randomly assign k observations as cluster centers/ centroids for K Clusters
• Step2: Continue assignment of clusters & updating cluster centroids until no change
• Calculate Square Euclidean Distance between each of the observations and Cluster Centroids
• Reassign based on least distance
• Update cluster centroids based on the revised assignments

We have height and weight features or variables. Using these two variables, we need to group the objects into 2 clusters.

If you look at the above chart, you will expect that there are two visible clusters/segments and we want these to be identified using the K Means algorithm.

Data Sample

 Obs Height Weight 1 185 72 2 170 56 3 168 60 4 179 68 5 177 62 6 188 77 7 180 71 8 180 52 9 183 84 10 180 88 11 180 67 12 177 76

K Means Clustering requires the value of K as inputs. For this example, we are considering the value of K as 2

## Iteration 1

### 1.0: Initialize cluster centroid

In this example, the value of K is considered as 2.  Cluster centroids are initialized with the first 2 observations.

 Cluster Initial Centroid Height Weight K1 185 72 K2 170 56

### 1.1: Calculate Euclidean Distance between each of these 2 cluster centroids and each of the observations

Euclidean is one of the distance measures used on the K Means algorithm. Euclidean distance between each of the observations and initial cluster centroids 1 and 2 is calculated.

#### Distance Calculations: We will use squared Euclidean Distance for the assignment.

 Obs Height Weight Squared Euclidean Distance- Centroid 1 Squared Euclidean Distance- Centroid 2 1 185 72 (185-185) **2+(72-72) **2 (170-185) **2+(56-72) **2 2 170 56 (185-170) **2+(72-56) **2 (170-170) **2+(56-56) **2 3 168 60 (185-168) **2+(72-60) **2 (170-168) **2+(56-60) **2 4 179 68 (185-179) **2+(72-68) **2 (170-179)**2+(56-68)**2 5 177 62 (185-177) **2+(72-62) **2 (170-177)**2+(56-62)**2 6 188 77 (185-188) **2+(72-77) **2 (170-188)**2+(56-77)**2 7 180 71 (185-180) **2+(72-71) **2 (170-180)**2+(56-71)**2 8 180 52 (185-180) **2+(72-70) **2 (170-180)**2+(56-52)**2 9 183 84 (185-183) **2+(72-84) **2 (170-183)**2+(56-84)**2 10 180 88 (185-180) **2+(72-88) **2 (170-180)**2+(56-88)**2 11 180 67 (185-180) **2+(72-67) **2 (170-180)**2+(56-67)**2 12 177 76 (185-177) **2+(72-76) **2 (170-177)**2+(56-76)**2

### 1.3: Update Cluster Centroid

Has the cluster assignment changed?

Yes, so continue for the next iteration.

## Iteration 2

### 2.3 Cluster Centroids

Is there a change in the assignment?

NO – stop further iteration

### Final Clusters

A few important considerations in K Means

• The scale of measurements influences Euclidean Distance, so variable standardization becomes necessary
• Depending on expectations - you may require outlier treatment
• K Means clustering may be biased on initial centroids - called cluster seeds
• Maximum number of clusters is an input parameter and may also impact the clusters getting created

How do we decide the value of K in K-Means Clustering?

When a k means clustering project is being done, multiple values of k are considered. There are a few considerations to select the final clustering is selected.

Observation % in each of the clusters

R^2 value of each of the variable

Overall R^2 value and other clustering performance statistics e.g. CCC

Elbow method can help in identifying the number of clusters based on the sum of squared distance (SSE) value

### How do we standardize variables in K Means Clustering?

In K Means Clustering, typically continuous variables are considered. Within continuous variables, the variable measurement scale can be significantly different.  For example, Age values could vary from 0 to 100, but the salary variable takes values from 0 to hundreds of thousands.

In K Means clustering, one of the similarity measures used is Euclidean Distance. And Euclidean distance is calculated as follows.

Euclidean Distance Measure can be biased due to the scale of measurement. In this example, we wanted to explore how the scale of measurement can impact K Means clustering and assignment of objects to different clusters.

The maximum contribution of age could be (33-20)2 =169 but even the smallest difference between Salary for the objects will be in thousands. So, the scale of measurement will have an impact in Euclidean Distance calculation or similarity measure.

Now, one of the ways to standardizing variables is to subtract by mean and then divide with standard deviation.

Now, we want to really see the impact of the scale in K Means clustering. So, we would want to run K Means with and without scaled variables.

Other tricks used for variable standardization are

• Percentile Ranks
• (X - min X)/(max X - min X)

Scenario 1: K Means Clustering in R without Variable Standardization

Scenario 2: K Means Clustering using R with Variable Standardization

Analysis and Conclusion

• Though from graphical /visualization ( which takes care of scaling), it was evident that observations 2,3 and 5 should be clubbed together, when K Means clustering without variable standardization is used, the observation 3, and 4 are put into one cluster.
• Though Age values for observations 3 and 4 are significantly different, the importance was given to Salary which has a high scale of measurement.
• In scenario 2, we have standardized the variables - Age and Salary. Now both of these variables have the same scale of measurement.
• In scenario 2, observations 1 and 4 are considered into one cluster and observation 2,3 and 5 into the second cluster. This looks in-line with expectations and visualization as well.
• It can be concluded that the scale of measure plays an important role in K Means clustering and variable standardization should be considered to suppress the effect of measurement scale.

## Concluding thoughts

In this blog, we have learned the key concepts of K Means Algorithm. How does K Mean Clustering algorithm group objects together? Also, finding the right value of K is important. We have discussed both practical considerations and elbow methods to find the value of K in K Mean Clustering. In the last, we have discussed the role of variable standardization with an example.

Do you have any questions or comments? Do share and we will learn and improve together.

Tags
Most Popular
Jun 18, 2020
Jul 23, 2020
Jun 19, 2020
Jun 19, 2021