How to segment customers with KMeans Clustering | Unsupervised Machine Learning

Ankit Bagga
5 min readAug 16, 2021

Marketers have always relied on data to personalize their customer’s experience but looking at your audience from the basic demographic lens is passè. Thus Machine learning comes into the picture. With customer segmentation, the idea is to cluster “similar” customers together and personalize marketing endeavors with respect to those clusters.

In the case of an ML algorithm, the model decides (on the basis of the information fed into the model) which customers are “similar” and segment them into clusters.

On a basic level, you can do RFM (Recency, Frequency, and Monetary Value) groupings to cluster your customers but when things can become complex with multiple dimensions, we can use Unsupervised Machine Learning (KMeans clustering) to cluster and segment customers for targeting.

What is k-means clustering?

In this unsupervised machine learning algorithm, we start with a given number of clusters, represented by “k”. We then assign data points to the clusters based on their nearest cluster.

Post that we calculate the mean of clusters and re-assign the nearest clusters basis the newfound means as centroids of the clusters. We repeat this process until assigned clusters don’t change. This video on k-means, explains it best.

https://www.youtube.com/watch?v=4b5d3muPQmA&t=173s

Garbage-in Garbage-Out

Like any other ML algorithm, the results of the model depend on the quality and relevance of the data fed while creating the model. We marketers have to be very cognizant of the data points we are using to develop our model. These should be dimensions that explain your customers well enough to segment them.

The final assigned clusters to your data need to make logical sense before you start personalizing your marketing.

A cluster might represent different permutations and combinations of the information that is fed into the model.

Eg: If clustering is done on the basis of gender, annual income and purchasing patterns of the customers, cluster 1 can represent primarily males with an annual income of more than $10,000 and with a maximum of 10 purchases in a year.

We start by loading the dataset as “customer_data”.

We then drop variables like SL No., and Customer ID which are of no significance in building the model.

Scaling Data

Before we start the modeling process, we need to standardize all continuous variables so that all variables can be brought down to a similar scale. For categorical variables, we need to then create dummy variables for the same.

To find out the ideal number of clusters for the K-Means clustering, we import the model and find the sum of squared error (SSE) with respect to different values of “k”. We then use what is commonly known as the “Elbow Method” to find out the ideal value for “k” post which the SSE doesn’t decrease a lot with an increase in clusters. It is always advisable to keep the number of clusters as minimum as possible.

Here we have selected k=3 because if k increases more than 3 then we are not seeing a drop in SSE that much. We then run the model on our scaled data to find out the labels for each customer. The labels are either 0,1,2 and are stored in an array called “model.labels_”

We can then map these labels to the raw data to understand which customer ID falls in which cluster.

Mapping labels to the raw data

Here we can see the “Customer Key” mapped to the model labels. Also, we can aggregate the data to understand the 3 clusters in terms of the given dimensions of the data.

Here we can see that cluster 1 has the highest number of customers with the highest mean avg. credit limit and avg. credit cards and avg. total bank visits and avg. total calls made. Cluster 2 has the least of all variables and has only 50 customers in it.

It seems that our most valuable customers are in Cluster 0 followed by 1and 2. We should thus prioritize our discounting and marketing activities accordingly.

Visualising it in 3D

We can also draw these clusters with 3 dimensions at a time (as more than 3 dimensions will be difficult to represent in 2 D space) and see how these clusters have clubbed the data points.

If we look closely at the 3d graph above, we can see that the model has done the right job clustering similar customers together (at least in the 3 dimensions considered for the graph). If you wish to extend the idea to other dimensions, you can run the same code with the other dimensions and analyze the results.

While there are other Unsupervised Machine Learning Algorithms that you can use to segment customers, KMeans clustering is the most popular. If you want to access the raw data and python files, you can visit https://github.com/AnkitBagga31/Customer_segmentation.

To learn more about the K-means clustering algorithm in practice, you can watch my youtube video and visit ankitbagga.com for more advanced marketing resources.

--

--