The first step is to install the fuzzy-c-means package using pip
or conda
:
pip install fuzzy-c-means
conda install fuzzy-c-means
Before starting the clustering, we need to read the read and explore the data brief to see if there is any missing value.
import pandas as pd
# Load the Mall customer data
url = 'https://raw.githubusercontent.com/ITE-5th/fuzzy-clustering/master/data/crime_data.csv'
data = pd.read_csv(url)
data.head()
crime$cluster | Murder | Assault | UrbanPop | Rape | ||
---|---|---|---|---|---|---|
0 | Alabama | 4 | 13.2 | 236 | 58 | 21.2 |
1 | Alaska | 4 | 10.0 | 263 | 48 | 44.5 |
2 | Arizona | 4 | 8.1 | 294 | 80 | 31.0 |
3 | Arkansas | 3 | 8.8 | 190 | 50 | 19.5 |
4 | California | 4 | 9.0 | 276 | 91 | 40.6 |
This dataset includes a 'crime$cluster', which already cluster the crime into 4 groups. We will cluster the data using fcm algorithm ourselves.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 50 non-null object 1 crime$cluster 50 non-null int64 2 Murder 50 non-null float64 3 Assault 50 non-null int64 4 UrbanPop 50 non-null int64 5 Rape 50 non-null float64 dtypes: float64(2), int64(3), object(1) memory usage: 2.5+ KB
From the above result, it displays that there is no missing value in the data. Different missing value detection methods can be referred to this previous article.
data.describe()
crime$cluster | Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|---|
count | 50.000000 | 50.00000 | 50.000000 | 50.000000 | 50.000000 |
mean | 2.720000 | 7.78800 | 170.760000 | 65.540000 | 21.232000 |
std | 1.125584 | 4.35551 | 83.337661 | 14.474763 | 9.366385 |
min | 1.000000 | 0.80000 | 45.000000 | 32.000000 | 7.300000 |
25% | 2.000000 | 4.07500 | 109.000000 | 54.500000 | 15.075000 |
50% | 3.000000 | 7.25000 | 159.000000 | 66.000000 | 20.100000 |
75% | 4.000000 | 11.25000 | 249.000000 | 77.750000 | 26.175000 |
max | 4.000000 | 17.40000 | 337.000000 | 91.000000 | 46.000000 |
import matplotlib.pyplot as plt
# set the plotting style to 'ggplot'
plt.style.use('ggplot')
data.plot.box()
<Axes: >
The results display that there are no severe extreme values, or outliers. Different missing value detection methods can be referred to this previous article.
cols = ['Murder', 'Assault','Rape']
features = data[cols]
features.head()
Murder | Assault | Rape | |
---|---|---|---|
0 | 13.2 | 236 | 21.2 |
1 | 10.0 | 263 | 44.5 |
2 | 8.1 | 294 | 31.0 |
3 | 8.8 | 190 | 19.5 |
4 | 9.0 | 276 | 40.6 |
Before we can apply FCM, we need to scale the features to have a mean of 0 and a standard deviation of 1. This is done to ensure that all features are equally important in the clustering process.
from sklearn.preprocessing import StandardScaler
# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(features)
X
array([[ 1.25517927, 0.79078716, -0.00345116], [ 0.51301858, 1.11805959, 2.50942392], [ 0.07236067, 1.49381682, 1.05346626], [ 0.23470832, 0.23321191, -0.18679398], [ 0.28109336, 1.2756352 , 2.08881393], [ 0.02597562, 0.40290872, 1.88390137], [-1.04088037, -0.73648418, -1.09272319], [-0.43787481, 0.81502956, -0.58583422], [ 1.76541475, 1.99078607, 1.1505301 ], [ 2.22926518, 0.48775713, 0.49265293], [-0.57702994, -1.51224105, -0.11129987], [-1.20322802, -0.61527217, -0.75839217], [ 0.60578867, 0.94836277, 0.29852525], [-0.13637203, -0.70012057, -0.0250209 ], [-1.29599811, -1.39102904, -1.07115345], [-0.41468229, -0.67587817, -0.34856705], [ 0.44344101, -0.74860538, -0.53190987], [ 1.76541475, 0.94836277, 0.10439756], [-1.31919063, -1.06375661, -1.44862395], [ 0.81452136, 1.56654403, 0.70835037], [-0.78576263, -0.26375734, -0.53190987], [ 1.00006153, 1.02108998, 1.49564599], [-1.1800355 , -1.19708982, -0.68289807], [ 1.9277624 , 1.06957478, -0.44563089], [ 0.28109336, 0.0877575 , 0.75148985], [-0.41468229, -0.74860538, -0.521125 ], [-0.80895515, -0.83345379, -0.51034012], [ 1.02325405, 0.98472638, 2.671197 ], [-1.31919063, -1.37890783, -1.26528114], [-0.08998698, -0.14254532, -0.26228808], [ 0.83771388, 1.38472601, 1.17209984], [ 0.76813632, 1.00896878, 0.52500755], [ 1.20879423, 2.01502847, -0.55347961], [-1.62069341, -1.52436225, -1.50254831], [-0.11317951, -0.61527217, 0.01811858], [-0.27552716, -0.23951493, -0.13286962], [-0.66980002, -0.14254532, 0.87012344], [-0.34510472, -0.78496898, -0.68289807], [-1.01768785, 0.03927269, -1.39469959], [ 1.53348953, 1.3119988 , 0.13675217], [-0.92491776, -1.027393 , -0.90938037], [ 1.25517927, 0.20896951, 0.61128652], [ 1.13921666, 0.36654512, 0.46029832], [-1.06407289, -0.61527217, 0.17989166], [-1.29599811, -1.48799864, -1.08193832], [ 0.16513075, -0.17890893, -0.05737552], [-0.87853272, -0.31224214, 0.53579242], [-0.48425985, -1.08799901, -1.28685088], [-1.20322802, -1.42739264, -1.1250778 ], [-0.22914211, -0.11830292, -0.60740397]])
Now, we can implement FCM on the preprocessed data:
We can use scatter plot to roughly evaluate number of clusters.
plt.scatter(X[:,0], X[:,1])
plt.xlabel('Murder')
plt.ylabel('Assault')
Text(0, 0.5, 'Assault')
From the scatter plot, we can roughly judge that there are 2 big clusters, which can be further divided into 3, 4 and even more clusters. We will divide the data into 3 clusters.
from fcmeans import FCM
# Define the number of clusters
n_clusters = 3
# Initialize the FCM algorithm
fcm = FCM(n_clusters=n_clusters)
# Fit the data to the FCM algorithm
fcm.fit(X)
# Get the cluster centers and the membership matrix
centroids = fcm.centers
membership_mat = fcm.u
Finally, we can assign each data point to the cluster with the highest membership:
import numpy as np
# Assign each data point to the cluster with the highest membership
labels = np.argmax(membership_mat, axis=1)
labels
array([2, 2, 2, 1, 2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 2, 1, 1, 0, 2, 0, 1, 2, 2, 2, 0, 1, 1, 1, 0, 0, 2, 0, 2, 2, 1, 0, 1, 1, 0, 0, 1], dtype=int64)
# or use predict() directly
labels = fcm.predict(X)
labels
array([2, 2, 2, 1, 2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 0, 1, 1, 2, 0, 2, 1, 2, 0, 2, 1, 1, 0, 2, 0, 1, 2, 2, 2, 0, 1, 1, 1, 0, 0, 2, 0, 2, 2, 1, 0, 1, 1, 0, 0, 1], dtype=int64)
Clustering validation is an important step in any clustering analysis as it helps to evaluate the quality of the clustering results. One way to validate the clustering results is by using the Partition Coefficient (PC) and Partition Entropy Coefficient (PEC) measures.
The Partition Coefficient (PC) measures the degree of homogeneity within each cluster. It is defined as the ratio of the sum of the squares of the number of data points in each cluster to the square of the total number of data points. The formula for PC is:
PC = (1/N) * sum(c=1 to C) (n_c)^2 / N^2
where N is the total number of data points, C is the number of clusters, and n_c is the number of data points in cluster c.
The Partition Entropy Coefficient (PEC) measures the degree of heterogeneity among the clusters. It is defined as the negative sum of the product of the proportion of data points in each cluster and the logarithm of the proportion of data points in each cluster. The formula for PEC is:
PEC = -1 sum(c=1 to C) (n_c / N) log2(n_c / N)
where N is the total number of data points, C is the number of clusters, and n_c is the number of data points in cluster c.
Both PC and PEC range from 0 to 1, with higher values indicating better clustering results. A PC value closer to 1 indicates that the data points are well-distributed among the clusters, while a PEC value closer to 0 indicates that the clusters are highly homogeneous.
We can visualize how the values of PC and PEC change with different numbers of clusters to validate clustering results. First, let's create models with 2, 3, 4, 5, 6 and 7 centers.
n_clusters_list = [2, 3, 4, 5, 6, 7]
models = list()
for n_clusters in n_clusters_list:
fcm = FCM(n_clusters=n_clusters)
fcm.fit(X)
models.append(fcm)
Then, we can easily calculate the values of PC and PEC using the partition_coefficientthe
and partition_entropy_coefficient
metrics of fuzzy-c-means package.
# outputs
num_clusters = len(n_clusters_list)
rows = int(np.ceil(np.sqrt(num_clusters)))
cols = int(np.ceil(num_clusters / rows))
f, axes = plt.subplots(rows, cols, figsize=(11,12))
for n_clusters, model, axe in zip(n_clusters_list, models, axes.ravel()):
# get validation metrics
pc = model.partition_coefficient
pec = model.partition_entropy_coefficient
fcm_centers = model.centers
fcm_labels = model.predict(X)
# plot result
axe.scatter(X[:,0], X[:,1], c=fcm_labels, alpha=.9)
axe.scatter(fcm_centers[:,0], fcm_centers[:,1], marker="+", s=200, c='r')
axe.set_title(f'n_clusters = {n_clusters}, PC = {pc:.3f}, PEC = {pec:.3f}')
plt.show()
It displays that the model with 2 clusters obtains the highest partition coefficient (PC) value and the lowerst partition entropy coefficient (PEC) value for this example, and the models with 3 and 4 clusters also obtain comparatively higher PC value and lower PEC.
We can visualize the results of the clustering algorithm by plotting the data points with different colors based on their assigned cluster.
In the following example, we plot a scatter plot between 'Murder' and 'Assault' with the centroids of each cluster.
# Define the colors for each cluster
colors = ['b', 'g', 'r', 'c', 'm']
n_clusters=3
# Plot the data points
for i in range(n_clusters):
plt.scatter(X[labels == i, 0], X[labels == i, 1], c=colors[i])
# Plot the centroids
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=100, c='#050505')
# Set the axis labels
plt.xlabel('Murder')
plt.ylabel('Assault')
# Show the plot
plt.show()
results = features.copy()
results['Labels'] = labels
results.head()
Murder | Assault | Rape | Labels | |
---|---|---|---|---|
0 | 13.2 | 236 | 21.2 | 2 |
1 | 10.0 | 263 | 44.5 | 2 |
2 | 8.1 | 294 | 31.0 | 2 |
3 | 8.8 | 190 | 19.5 | 1 |
4 | 9.0 | 276 | 40.6 | 2 |
import seaborn as sns
# Create a scatterplot matrix
sns.pairplot(results, hue='Labels', palette='Dark2')
<seaborn.axisgrid.PairGrid at 0x1f3f64f4700>
Fuzzy C-Means clustering is a powerful unsupervised machine learning technique that can be used to group data points with similar characteristics. It is particularly useful when the boundaries between clusters are not well-defined, or when a data point could belong to multiple clusters with different degrees of membership.
In this tutorial, we used the crime data to demonstrate how Fuzzy C-Means clustering can be easily implemented in Python using the fuzzy-c-means
package library. We loaded the data, preprocessed it, and implemented the FCM algorithm to cluster the customers based on their 'Murder', 'Assault', and 'Rape'. Finally, we visualized the results of the clustering by plotting the data points with different colors based on their assigned clusters.
Overall, Fuzzy C-Means clustering is a useful tool for data analysis, and it can be applied to a wide range of real-world problems, such as customer segmentation, image segmentation, and pattern recognition.