import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
import scipy.cluster.hierarchy as shc
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
In this article, we use another famous customer segmentation dataset, which contains more variables than the customer dataset used in the previous article of K-means clustering. You can download it from this link. We only need the Train.csv
. After downloading the dataset, you'd better create a data
folder in your current working directory and put the data inside it. Then let's use pandas to read the data.
df = pd.read_csv('./data/Train.csv', index_col='ID')
df.head()
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|
ID | ||||||||||
462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
where:
In this section, the main task is to preprocess the data through feature selection, missing values processing and categorical/string variable encoding.
df.columns
Index(['Gender', 'Ever_Married', 'Age', 'Graduated', 'Profession', 'Work_Experience', 'Spending_Score', 'Family_Size', 'Var_1', 'Segmentation'], dtype='object')
We drop the columns of 'Var_1' and 'Segmentation' because they are not helpful for our analysis.
df_select = df.drop(['Var_1','Segmentation'], axis=1)
df_select.head()
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 |
462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 |
466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 |
461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 |
462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 |
df_select.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8068 entries, 462809 to 461879 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 8068 non-null object 1 Ever_Married 7928 non-null object 2 Age 8068 non-null int64 3 Graduated 7990 non-null object 4 Profession 7944 non-null object 5 Work_Experience 7239 non-null float64 6 Spending_Score 8068 non-null object 7 Family_Size 7733 non-null float64 dtypes: float64(2), int64(1), object(5) memory usage: 567.3+ KB
The data information displays that some variables have missing values. Let's further check how many values in percentage are missing.
pd.DataFrame({'missing':df_select.isnull().sum(),
'percentage':(df_select.isnull().sum() / np.shape(df_select)[0]) * 100})
missing | percentage | |
---|---|---|
Gender | 0 | 0.000000 |
Ever_Married | 140 | 1.735250 |
Age | 0 | 0.000000 |
Graduated | 78 | 0.966782 |
Profession | 124 | 1.536936 |
Work_Experience | 829 | 10.275161 |
Spending_Score | 0 | 0.000000 |
Family_Size | 335 | 4.152206 |
The above results show that the maximum percentage of missing values only takes around 10%, thus we can simply remove the missing values.
df_select = df_select.dropna()
df_select.isnull().sum()
Gender 0 Ever_Married 0 Age 0 Graduated 0 Profession 0 Work_Experience 0 Spending_Score 0 Family_Size 0 dtype: int64
We have several categorical/string variables, which are needed to encode to numbers. If you are not familiar with categorical variable encoding, please refer to this previous article. In this example, we will use OneHotEncoder()
in Scikit learn so that we can analyze them separately.
# select them easily
cat_cols = df_select.select_dtypes(exclude=["number"]).columns
cat_cols
Index(['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score'], dtype='object')
# onehotencode the categorical features
encoder = preprocessing.OneHotEncoder(sparse_output=False)
trans = encoder.fit_transform(df_select[cat_cols])
# obtain the encoded column names
enc_columns = encoder.get_feature_names_out(cat_cols)
# converse the encoded feature array into DataFrame
features_enc = pd.DataFrame(trans, columns=enc_columns)
# add the encoded features on the selected feature DataFrame
df_enc = pd.concat([df_select,features_enc.set_index(df_select.index)],axis=1)
# drop the original unencoded categorical columns
df_enc.drop(cat_cols,axis=1, inplace=True)
df_enc.head()
Age | Work_Experience | Family_Size | Gender_Female | Gender_Male | Ever_Married_No | Ever_Married_Yes | Graduated_No | Graduated_Yes | Profession_Artist | ... | Profession_Engineer | Profession_Entertainment | Profession_Executive | Profession_Healthcare | Profession_Homemaker | Profession_Lawyer | Profession_Marketing | Spending_Score_Average | Spending_Score_High | Spending_Score_Low | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | |||||||||||||||||||||
462809 | 22 | 1.0 | 4.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
466315 | 67 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
461735 | 67 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
461319 | 56 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
460156 | 32 | 1.0 | 3.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 21 columns
X = df_enc.values
X
array([[22., 1., 4., ..., 0., 0., 1.], [67., 1., 1., ..., 0., 0., 1.], [67., 0., 2., ..., 0., 1., 0.], ..., [33., 1., 1., ..., 0., 0., 1.], [27., 1., 4., ..., 0., 0., 1.], [37., 0., 3., ..., 1., 0., 0.]])
A dendrogram is a tree-like diagram that shows the hierarchical relationships between clusters in a hierarchical clustering algorithm. The dendrogram function is a visualization tool provided by the scipy.cluster.hierarchy
module in Python for displaying the hierarchical clustering results in the form of a dendrogram.
The dendrogram function takes a linkage matrix as input, which contains the pairwise distances between the data points and the clusters, as well as information about how the clusters were merged during the clustering process. The function then plots the dendrogram, where the height of each node in the tree represents the distance between the clusters at that level.
The dendrogram function also provides several optional parameters to customize the appearance of the plot, such as the orientation of the tree, the color of the branches, and the font size of the labels.
The dendrogram function is a useful tool for visualizing the hierarchical clustering results and gaining insights into the structure of the data. It can also help in determining the optimal number of clusters to use for subsequent analysis by identifying the point in the tree where the clusters merge at a distance that maximizes the between-cluster variance and minimizes the within-cluster variance.
First, we need to calculate the distance matrix between the data points using the linkage function from the scipy.cluster.hierarchy
library. The linkage function calculates the distance between clusters using the specified linkage criterion, which determines the distance between the clusters. There are several linkage criteria available, including "ward", "complete", "average", and "single". We use the "ward" in this example.
linkage_matrix = shc.linkage(X, 'ward')
After calculating the linkage matrix, we can plot the dendrogram using the dendrogram function from the same library. The dendrogram shows the hierarchical relationships between the data points and the clusters. In the plot, we set three horizontal threshold lines to display the options of cluster numbers.
plt.figure(figsize=(10, 7))
shc.dendrogram(linkage_matrix)
plt.title("Customers Dendrogram")
plt.axhline(y=1000, color='r', linestyle="--")
plt.axhline(y=600, color='m', linestyle="--")
plt.axhline(y=400, color='y', linestyle="--")
plt.show()
The Dendrogram provides us a reference to choose the number of clusters. From the top to bottom, it suggests that 2 clusters would be the first option, and then 3, 4 would be the second, third options and so on.
To implement hierarchical clustering with scikit-learn, we need to follow a few steps as follows.
First, we need to define the hierarchical clustering model. In scikit-learn, we can use the AgglomerativeClustering
class to define the model.
We can specify the number of clusters we want to create using the n_clusters
parameter. 2 clusters seem too general, so we choose 3 in this example. We can also specify the linkage criterion to use for merging the clusters using the linkage
parameter. The linkage criterion determines the distance between the clusters.
model = AgglomerativeClustering(n_clusters=3, linkage='ward')
After defining the model, we can fit it to the data using the fit method and obtain the predicted labels for each data point using the labels_ attribute.
model.fit(X)
labels = model.labels_
labels
array([0, 2, 2, ..., 0, 0, 0], dtype=int64)
Finally, we can evaluate the clustering performance using metrics such as silhouette score, which measures the similarity between the data points within a cluster compared to the data points in other clusters.
To evaluate the clustering performance of cluster 3, we compare the silhouette scores of clusters 2 to 10.
score=[]
range_n_clusters = range(2, 11)
for num_clusters in range_n_clusters:
# intialise Hierarchical Clustering
hcluster = AgglomerativeClustering(n_clusters=num_clusters, linkage='ward')
hcluster.fit(X)
cluster_labels = hcluster.labels_
# silhouette score
silhouette_avg = silhouette_score(X, cluster_labels)
score.append(silhouette_avg)
print("For n_clusters={0}, the silhouette score is {1:.2f}".format(num_clusters, silhouette_avg))
For n_clusters=2, the silhouette score is 0.53 For n_clusters=3, the silhouette score is 0.48 For n_clusters=4, the silhouette score is 0.40 For n_clusters=5, the silhouette score is 0.39 For n_clusters=6, the silhouette score is 0.36 For n_clusters=7, the silhouette score is 0.36 For n_clusters=8, the silhouette score is 0.37 For n_clusters=9, the silhouette score is 0.38 For n_clusters=10, the silhouette score is 0.38
The comparison results of silhouette scores confirm that 3 clusters are effective.
Here, we generally present the results in tables and diagrams.
First, we create a result table by adding the cluster labels in the tables.
results = df_enc.copy()
results['Clusters'] = labels
results.head()
Age | Work_Experience | Family_Size | Gender_Female | Gender_Male | Ever_Married_No | Ever_Married_Yes | Graduated_No | Graduated_Yes | Profession_Artist | ... | Profession_Entertainment | Profession_Executive | Profession_Healthcare | Profession_Homemaker | Profession_Lawyer | Profession_Marketing | Spending_Score_Average | Spending_Score_High | Spending_Score_Low | Clusters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | |||||||||||||||||||||
462809 | 22 | 1.0 | 4.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 |
466315 | 67 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2 |
461735 | 67 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2 |
461319 | 56 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1 |
460156 | 32 | 1.0 | 3.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 |
5 rows × 22 columns
In the previous article on K-Mean clustering, we use more visualization methods to display the results. In this example, there are several encoded categorical variables with values of 0 and 1. Scatter plot seems not a good way to display the results. Let's use a parallel coordinates plot to show the results.
plt.figure(figsize=(15,8))
pd.plotting.parallel_coordinates(results,'Clusters',alpha=0.90,sort_labels=True)
plt.xticks(rotation=80)
plt.show()
The results clearly show that persons in cluster 0 are much younger people, while cluster 2 has older people. Due to large values of ages, it is hard to interpret other variables. Let's remove age to plot again.
results_drop_age = results.drop(['Age'],axis=1)
plt.figure(figsize=(15,8))
pd.plotting.parallel_coordinates(results_drop_age,'Clusters',alpha=0.90,sort_labels=True)
plt.xticks(rotation=80)
plt.show()
Now, we can see more information from the plot. For example, cluster 0 has more working experiences and family size, etc.
We can analyze the results through summary statistics of each cluster. In this example, we just average the variables of 'Age', 'Work_Experience' and 'Family_Size' and count the total number of each of the other variables in each cluster.
# numerical column names
num_cols = results.iloc[:,0:3].columns
# encoded categorical column names
cat_cols = results.iloc[:,3:-1].columns
# means of original numerical column items by clusters
results_mean = results.groupby(['Clusters'])[[num_cols][0]].mean()
# Count total numbers of column items by clusters
results_count = results.groupby('Clusters')[[cat_cols][0]].apply(lambda x: (x==1).sum())
# combine the results into one DateFrame
results_cluster = pd.concat([results_mean,results_count],axis=1)
results_count
Gender_Female | Gender_Male | Ever_Married_No | Ever_Married_Yes | Graduated_No | Graduated_Yes | Profession_Artist | Profession_Doctor | Profession_Engineer | Profession_Entertainment | Profession_Executive | Profession_Healthcare | Profession_Homemaker | Profession_Lawyer | Profession_Marketing | Spending_Score_Average | Spending_Score_High | Spending_Score_Low | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Clusters | ||||||||||||||||||
0 | 1819 | 2075 | 2437 | 1457 | 1740 | 2154 | 1029 | 450 | 370 | 493 | 180 | 1059 | 133 | 8 | 172 | 718 | 252 | 2924 |
1 | 873 | 1195 | 272 | 1796 | 421 | 1647 | 1047 | 124 | 192 | 278 | 234 | 28 | 40 | 75 | 50 | 875 | 391 | 802 |
2 | 324 | 432 | 34 | 722 | 279 | 477 | 135 | 20 | 24 | 44 | 95 | 2 | 5 | 420 | 11 | 84 | 369 | 303 |
# change index of 'Cluster to column'
results_cluster.reset_index(inplace=True)
results_cluster
Clusters | Age | Work_Experience | Family_Size | Gender_Female | Gender_Male | Ever_Married_No | Ever_Married_Yes | Graduated_No | Graduated_Yes | ... | Profession_Engineer | Profession_Entertainment | Profession_Executive | Profession_Healthcare | Profession_Homemaker | Profession_Lawyer | Profession_Marketing | Spending_Score_Average | Spending_Score_High | Spending_Score_Low | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 32.002825 | 3.254237 | 3.048279 | 1819 | 2075 | 2437 | 1457 | 1740 | 2154 | ... | 370 | 493 | 180 | 1059 | 133 | 8 | 172 | 718 | 252 | 2924 |
1 | 1 | 53.324468 | 1.977273 | 2.754836 | 873 | 1195 | 272 | 1796 | 421 | 1647 | ... | 192 | 278 | 234 | 28 | 40 | 75 | 50 | 875 | 391 | 802 |
2 | 2 | 76.060847 | 1.197090 | 2.015873 | 324 | 432 | 34 | 722 | 279 | 477 | ... | 24 | 44 | 95 | 2 | 5 | 420 | 11 | 84 | 369 | 303 |
3 rows × 22 columns
These analysis results are very helpful to get more insights on each group. For example, cluster 0 has more females than the other two clusters, but there are more males than females in all the three clusters.
plt.figure(figsize=(15,8))
pd.plotting.parallel_coordinates(results_cluster,'Clusters',alpha=0.9,sort_labels=True)
plt.xticks(rotation=80)
plt.show()
The plot of above table provides us more visualized results, which might help us dig into more potential values from the data. The focus of the article is on methods rather than analysis, we will not discuss more insights from the results.
Clustering is a powerful unsupervised machine learning technique that can help us find structure in our data without any prior knowledge of the underlying patterns. In this tutorial, we have explored how to implement hierarchical clustering using Python scikit learn library and visualize the results using dendrograms and parallel coordinates plot.
We started by loading the dataset and select the features, detecting and processing missing values and encoding the categorical variable. Next, it calculates the linkage matrix using the ward linkage criterion, and create a dendrogram to help select number of clustering. Then, we implement hierarchical clustering with scikit-learn through defining the model using AgglomerativeClustering
class, fitting the model and evaluate the results with silhouette score. Finally, we present the results through summary statistic and parallel coordinates plot.
It is worth noting that hierarchical clustering can be computationally expensive for large datasets, and the choice of linkage criterion can have a significant impact on the results. Therefore, it is important to carefully select the appropriate linkage criterion and tune the distance threshold to obtain the best results.
Overall, hierarchical clustering is a powerful technique that can help us gain insights into our data and identify meaningful clusters. By visualizing the results using dendrograms and parallel coordinates plot, we can better understand the underlying structure of our data and make more informed decisions.