Clustering with Fuzzy C-Means
Imagine, you may have collected data about your customers, such as age, height, body weight, earlobe area, egg consumption. The data definitely was collected with the consent of your users and now you want to figure out if your users can be grouped into a certain number of groups, so you can provide better service to each of them. Most likely, some users won’t fall into exactly one group, so you’d also like to quantify each user’s partial membership in each group.
One common algorithm to cluster such data is K-means Clustering.
However, it will assign each data point to exactly one group.
Very similar is the Fuzzy C-Means algorithm, where you have a hyper-parameter, determining how sharp the group boundaries will be.
There are various methods to find out, which number of clusters may best describe your data but let’s say, you already know how many clusters there are in your customer base, e.g. k
clusters.
The algorithm is rather simple:
- To each data point
x
, initializek
values of cluster affiliations randomly - Repeat the following steps until you reach the set limit of iterations or the change in cluster affilitations falls below a set error threshold
- Calculate the centroid of each cluster by calculating the weighted sum of each datapoint, using the affilitation to the current cluster as weight.
- Update each datapoint’s affilitation to the cluster by calculating their distance to the cluster centroid
The algorithm will work with any n-dimensional data but in 2D, it’s very easy to visualize. Below you can set the parameters of the algorithm. Push the datapoints around with your mouse to form clusters and click the button to run the algorithm. It will colorize them according to their cluster membership.
You may quickly realize that the algorithm has an obvious weakness: If a cluster is stretched very long, some of its points may get assigned to a different cluster, even though there’s a clear gap between them. This is because the algorithm bases on the distances to the cluster centers but doesn’t take into account the individual distances of datapoints to each other.
The code can be found here.
Quite a while ago, I also wrote a Python version with standalone GUI, which you can find here.