K Means Clustering for Customer Segmentation
So, on the way to pursuing data analysis as a profession I am trying to work on specific projects that can weigh my portfolio and while thinking of analysis, I believe the most valuable and contributing factor of data is to benefit the business and the success of a business is most certainly proportional to the value given to the customer. The happier a customer is the more loyal he/she would be to the brand and ultimately it would contribute to the business. And my mind translates customer value to knowing your customer so that you can treat them in the most customizable way, and that could be done by “Customer Segmentation”.
What is customer segmentation?
Have you ever had an experience making/crafting a music playlist for yourself or for someone else?
What approach would you follow? So, you must first define your mood, activity, and vibe when you would be listening to the song right? so what you are exactly doing is segmenting your playlist based on your mood and activity and the same can be defined for customers because every single person buys from you is having their own reason and approach to buy anything which we can study as a pattern of buying behavior.
Here my goal is to segment customers of a mall with the help of their demographic and psychographic data.
The source of this data is Kaggle and here is the link to it.
The methodology I have used for customer segmentation is Clustering and more specifically KMeans clustering which I have discussed down below.
What is clustering?
Dividing of data in groups based on similarity such that object on a single group (Cluster) is more similar than that of the other. In simpler terms clustering is an activity of dividing objects in an assembly that shows similar behavior.
Clustering is an unsupervised machine learning algorithm where the data isn’t labeled to the resulting; rather, it identifies patterns within the data and assembles them in a homogenous group. We have used KMeans clustering which is a popular clustering methodology that divides the population into a “K” number of clusters.
Since we can dictate clusters so we can easily make them equal to the number of classes or more than that, and interestingly we can use the clusters as features of the supervised machine learning algorithm.
Data Analysis Processes
So now we would be finally following the data processing, manipulation, modeling, and plotting,
Data Exploration
Analysis (Univariate, bivariate, and multivariate)
Modeling (Clustering)
KMeans Algorithm
Data exploration and Wrangling
Data exploration refers to knowledge of data by looking at it and analyzing it from raw form to the cleaned and précised form.
First, we will import libraries that we will use for analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
Data loading:
df= pd.read_csv("D:\Data sets\Mall_Customers.csv")
df.head(10)
So, we can see that we have 5 variables “CustomerID”, “Gender”, “Age”, “Annual Income(K$)” and “Spending Score”. Out of these “CustomerID” is an indexing variable and doesn’t speak much about the data so if you want you can delete this column but I will like to keep that as it’s not a hindrance to my analysis. Speaking of my way of analysis I would love the last two columns as in words instead of symbols. We can also transform Gender into binary numbers as it's of 2 types but I would keep it the same for now to make it easy for describing the analysis.
df.rename(columns = {'Annual Income (k$)':'Annual_Income_inThousandUSD','Spending Score (1-100)':'Spending_Score' },
inplace = True)
df.head()
Now that the data frame looks okay, let's know more about the data itself:
Here we get to know that there are 200 entries and we have got mean which explains the average of our data, and we have got minimum and maximum values but the most useful information, we found is standard deviation which illustrates that the “Age” of our objects varies by 13.96 which approximately is 14 from the mean value of our data in the same way “Spending Score” varies by 25.82 from the average spending score that is “50.2”.
Another interesting feature is the percentile which speaks to the 25% of observations have age below 29 approximately and similarly 75% of our observations have “Annual Income” below 78 thousand US dollars. So, this information is quite revealing and we would like to better investigate by univariate and bivariate analysis.
Data Analysis
We have analyzed data numerically but now let’s dive into the graphical interpretation of data.
Univariate Analysis
we will approach the graphical interpretation first by univariate which means we will be plotting the quantitative variables against their occurrence(density).
This illustrates that most of the observations have “Annual Income” of around 50 and 75 thousand US dollars, to make it more clear we will plot a histogram.
So here we did get a better understanding that the most abundant observed Annual Income is in the “60’s range” followed by the “80s range” and so.
With understanding the data, I believe there are three columns that ought to be plotted so instead of writing code for all of them individually I have tried to do it combinedly by for loop here’s how:
columns=['Age', 'Annual_Income_inThousandUSD','Spending_Score']
for i in columns:
plt.figure()
sns.distplot(df[i])
plt.savefig('viz3')
I have tried a similar approach for plotting the “KDE” plot where we will investigate these three variables with respect to “Gender”.
for i in columns:
plt.figure()
sns.kdeplot(df[i],shade=True,hue=df['Gender']);
This plot shows that between 20 to 60 years of age, most of the observations are of females, which could be kept in mind while targeting the audiences.
Bivariate Analysis
Now we will be plotting attributes against each other to show the relevance and dependence of one variable on the other.
Here I can differentiate a few clusters based on the Annual Income and Spending score, here it's visible that a fraction of observations with low income but high spending score and also a fraction of high spending score and high income.
Let’s look for more data visually in pairs with respect to gender.
x=df.copy()
y=x.drop('CustomerID',axis=1)
sns.pairplot(y,hue='Gender');
Here we copied the data frame to X and then dropped the first column “CustomerID” because it won’t be of much importance.
Here I saw something interesting to know that it’s pretty evenly distributed with respect to both gender let's have a look at it numerically:
So, it speaks for our analysis that there isn’t much difference by gender.
To get a clear view of the relation, let's plot the correlation between these variables.
sns.heatmap(y.corr(),annot=True);
I have used annotation so that we can also get the numeric representation of correlation, and here I also get an interesting relation that as age increases there is a decrement in spending score so we can also keep that in mind when targeting audiences.
Clustering
I did feel I have got every aspect of analysis that I can think of from this data except for the clusters that we actually aim for dissecting our observations into homogeneous groups.
KMeans Algorithm
Let’s get into KMean clustering straightaway:
Univariate Clustering
clustering1= KMeans()
a=df.copy()
a=a.drop(df.columns[1], axis = 1)
a.head()
clustering1.fit(a[['Annual_Income_inThousandUSD']])
I have dropped the “Gender” column as it was non-numeric and wasn’t needed for clustering purposes.
One thing to note here is that I haven’t assigned the numbers of clusters so it would take the default number of clusters which is “8”, later we will find the right number of clusters by “Inertia” following “Elbow Technique”.
So, by this we get to know that clusters 7 and 8 are highly weighted in comparison to the rest of the clusters, let’s have a look at the graphical representation of the distribution of these clusters.
plt.figure(figsize=(5,5))
plt.scatter(clusters['Annual_Income_inThousandUSD'],clusters['Spending_Score'],c=clusters['cluster_pred'],cmap='rainbow')
plt.title("Clustering customers based on Annual Income and Spending score", fontsize=15,fontweight="bold")
plt.xlabel("Annual Income")
plt.ylabel("Spending Score")
plt.show()
It’s quite illustrative that clusters in “Sky Blue”, “Orange” and “Light Green” are of the observation with high spending scores whereas “Sky Blue”, “Beige”, “Orange” and “Purple” are of the highest income. But still, I don’t find these clusters well segmented so what I will be doing is to find the right number of clusters and I would do it with a measure called “Inertia” and a technique called “Elbow”.
Inertia
Inertia is simply a measure that suggests how well a data set is clustered, technically it's calculated by squaring the distance between objects and the centroid of the cluster it is counted in. So the more clusters the lesser value of inertia but here’s a tradeoff, we want a specific number of clusters that are evenly grouped based on their behavior so we will find the K numbers of clusters manually by the Elbow method.
Elbow Method
The elbow method illustrates that the optimal number of clusters is where the inertia value starts decreasing slowly, here’s how I have used this approach.
inertia_scores=[]
for i in range(1,11):
kmeans=KMeans(n_clusters=i)
kmeans.fit(a)
inertia_scores.append(kmeans.inertia_)
plt.plot(range(1,11),inertia_scores)
According to the curve, I feel it's number 5 from where the rate of inertia decreasing is comparatively slow, so I will go with 5 numbers of clusters.
Here’s how clusters look after giving K number clusters as 5, and we can still confirm if we have chosen the right number of clusters by a validating measure called Silhouette Score
Silhouette Score
Silhouette Score is the distance from the centroid to the object and the lesser the score the more validating our number of clusters is.
I feel it’s a satisfying score and once we have validated let's have a look at the average values of 5 clusters.
Bivariate Clustering
Moving forward from univariate clustering now I would be clustering “Annual Income” with “Spending Score” as well, here’s how I have done it,
clustering2=KMeans(5)
clustering2.fit(a[['Annual_Income_inThousandUSD','Spending_Score']])
clusters["cluster_pred2"]=clustering2.labels_
clusters.head()
I have given the value of K as 5 because I already checked inertia which even more explicitly tells that after 5 the inertia curve changes barely.
inertia_score2=[]
for i in range(1,11):
kmeans2=KMeans(n_clusters=i)
kmeans2.fit(a[['Annual_Income_inThousandUSD','Spending_Score']])
inertia_score2.append(kmeans2.inertia_)
plt.plot(range(1,11),inertia_score2)
Exploring Centroids
To confirm If I am moving forward rightly, I checked another parameter for clusters which is the centroid, and here’s how I have done.
centers=pd.DataFrame(clustering2.cluster_centers_)
centers.columns=['x','y']
plt.figure(figsize=(8,6))
plt.scatter(x=centers['x'],y=centers['y'],s=50,c='black',marker='*')
sns.scatterplot(data=clusters,x='Annual_Income_inThousandUSD', y='Spending_Score', hue='cluster_pred2',palette='tab10')
Once getting rightly done with clustering, let's analyze some numeric data with respect to the newly created cluster and split the percentage between Gender.
Multivariate Clustering
Now we can also do multivariate clustering and it's mostly done when we have more than 2 variables that affect the making of clusters in order to do it smoothly, I have first done standardization and scaling.
Feature Scaling
Most times the data we are dealing with has features of different units in measurement and each value has its own fashion in the distribution of values, which creates quite complexities to the algorithm such as the algorithm being biased to the feature with larger values and variances between them. So, to avoid this feature scaling is used to fit data to a standard scale and there are two major techniques for this “Standardization & Normalization”
Standardization
Standardization is a technique where we entail scaling data to fit a standard normal distribution. We do it by making the mean value of the attribute “Zero” and the standard deviation as “Unit”
For further explanation refer to this article:
https://www.apexon.com/blog/feature-scaling-for-ml-standardization-vs-normalization/
Moving on with multivariate clustering, here’s how I scaled and fit the data:
from sklearn.preprocessing import StandardScaler
scale= StandardScaler()
Once I scaled my data, I added back the “Gender” attribute by transforming it to a binary value, here Gender male with binary value “1” means male, and “0” is female. You can choose the naming of the column up to your own comfort. And to your convenience “Income-cluster” is “cluster_pred” and “Income_nd_SpendingScore_cluster” is basically “Cluster_pred2”.
Then I customized the newly copied data frame “dff” by trimming of previously found 2 clusters so that I can standardize the attributes of the original data set rather than the clusters.
dff= scale.fit_transform(dff)
dff=pd.DataFrame(scale.fit_transform(dff))
dff.head()
I see the values are nicely standardized and so let’s move with finding the inertia and ultimately the multivariate clustering.
clustering3=KMeans(5)
clustering3.fit(dff)
dff["multi_cluster"]=clustering3.labels_
dff.head()
dff["multi_cluster"].value_counts()
Summary/Findings
So, we are done with multivariate clustering, and with that standardization and scaling this is the summary of our findings based on the segmentation of customers through clustering:
It is found there are two clusters with high annual income but only one is having high spending score, so the one with high spending score should be targeted with advertising and the one with a lesser spending score should be further studied by its behavioral data.
The cluster number “0” is with low-income value but high spending score so there might be specific sort of product that they are buying
I feel there’s nothing else to be discovered and so I conclude this blog here,
You can find the source code here!
Let me know how did you find this process and if is there anything else you discovered from this data or analysis.