As a data scientist, you have several basic tools at your disposal, which you can also apply in combination to a data set. Here we present some clustering algorithms that you should definitely know and use
In times of Big Data, not only the sheer number of data increases, but also the relationships between them. More and more complex dependencies are formed. This makes it all the more difficult to recognize these similar properties and to assign the data to so-called clusters in a way that can be evaluated.
You have certainly heard of these algorithms and maybe used one or the other, but do you really know what clustering algorithms are?
What are clustering algorithms?
So let’s first clarify what these algorithms are in the first place. The goal is clear: You want to identify similar properties between individual data points in a data set and group them in a meaningful way. These properties are often high-dimensional.
With the help of cluster analysis, you want to reduce this high-dimensional information to a low-dimensional dependency. So, for example, a representation in 2D space. Clustering is an unsupervised machine learning technique and in the end you classify the data points by using algorithms.
The approach to clustering differs from technique to technique. All have their advantages and disadvantages, so it makes sense to try several on one set of data, or apply them in combination. Below we will introduce you to some popular clustering methods and explain their grouping approach.
The first algorithm we want to introduce you to is Mean-Shift Clustering. With this you can find dense areas of data points according to the concept of kernel density estimation (KDE). The basis of the clustering is a circular sliding window, which moves towards higher density at each iteration. Within the window, the centers of each class are determined, called centroids.
The movement is now created by moving the center to the average of the points within the window. The density within the sliding window is thus proportional to the number of points within it. This motion continues until there is no direction in which the motion can take more points within the kernel.
Hierarchical Cluster Analysis (HCA)
With HCA, clusters are formed based on empirical similarity measures of the data points. This means that the two most similar objects are assigned one after the other until all objects are in one cluster. This results in a tree-like structure. In contrast to the K-means algorithm, which we will discuss later, similarities between the clusters play a role. These are represented by a cluster distance. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters.
You can create an HCA in different ways. There are two elementary procedures, the top-down and the bottom-up. If you want to know more about Hierarchical Cluster Analysis, read this article.
Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
GMM basically assumes that the data points are Gaussian and not circular. The clusters are described by their mean and standard deviation. Each Gaussian distribution is randomly assigned to a single cluster and found using the Expectation-Maximization (EM) optimization algorithm. The probability of belonging to a cluster is then calculated for each data point. Thus, the closer a point is to the Gaussian center, the more likely it is then to belong to that cluster. Based on these probabilities, a new set of parameters for the Gaussian distributions is iteratively calculated. That is, the probabilities within a cluster are maximized.
K-Means clustering algorithms
The k-Means algorithm described by MacQueen, 1967 goes back to the methods described by Lloyd, 1957 and Forgy, 1965. You can use the algorithm besides cluster analysis also for vector quantization. Here, a data set is partitioned into k groups with equal variance.
The number of clusters must be specified in advance. Each disjoint cluster is described by the average of all contained samples. The so-called cluster centroid.
Each centroid is updated to represent the average of its constituent instances. This is done until the assignment of instances to the clusters does not
changes any more. If you want to learn more about the K-means algorithm, check this out.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is a density-based cluster analysis with noise. From an arbitrary starting data point, neighborhood points are specified at a distance epsilon. Clustering then begins from a certain neighborhood data point count.
The current data point becomes the first point of the new cluster, or referred to as noise. In both cases, however, it is considered to be examined. The neighboring data points are then added to the cluster. Once all neighbors have been added, a new, unexamined point is called and processed. A new cluster is thus formed.
The field of cluster algorithms is wide and everyone’s approach is different. You should be aware that there is no one solution. You have to consider each algorithm as another tool. Not every technique works equally well in every situation.
The key here is to always understand the basic approach of each algorithm you want to use. Build a small portfolio and get to know these techniques well. Once you master them, you should then add new ones. Knowing your own tools is crucial to avoid try and error and to gain control over your data. Remember: no result is a result. Your added value here is that even if an algorithm doesn’t work well on your data set, it will give you information about the data properties.