What does HCA stand for? What is the difference between Agglomerative and Divisive? When do I use the algorithm and what are its strengths? In this article we will clarify all these questions.
If you don’t know what clustering means, check out this article. Here we also explain four other clustering methods that you as a data scientist must know.
Table of Contents
What is an HCA?
Hierarchical Cluster Analysis, or HCA, is a technique for optimal and compact connection of objects based on empirical similarity measures. The two most similar objects are assigned one after another until all objects are finally in one cluster. This then results in a tree-like structure.
So how does a hierarchical cluster procedure work?
Agglomerative vs Divisive Calculation
The basic clustering can be done in two opposite ways, Agglomerative and Divisive calculation.
Agglomerative Nesting, abbreviated AGNES, is also known as the bottom-up method. This method first creates a cluster between two objects with high similarity, and then adds more clusters until all the data has been enclosed.
The divisive cluster calculation follows an opposite concept.
Divisive hierarchical clustering:
Divise Analysis, also known as DIANA, is a top-down method. All objects are directly framed into a cluster and then reduced in size.
In the following figure, the agglomerative process is compared with the divisive process.
Thus, the goal is to represent the common properties in low dimension in multidimensional raw data. A strength of this machine learning method is the inclusion of cluster relationships. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters. If you want to know more about this other popular clustering method, read this article.
How to calculate the cluster distances?
As mentioned earlier, not only are similarities between data points in a cluster weighted, but also similarities between groups. These similarities are represented by distances between the clusters. These distances can be determined in different ways. The distance between the centroids of two clusters can be calculated. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters.
The figure below contrasts each cluster distance calculation method.
In addition to the planar representation, the HCA can also be represented in a dendrogram.
HCA represented in a Dendrogram
Since an HCA describes a tree structure, it can be well represented in a dendrogram. Here the connections between the individual data elements and the connections between the clusters become well visible. This diagram can help to choose the optimal number of clusters in the data depending on where you intersect the tree.
In the following figure, for example, such a dendrogram is shown in dependence on Agglomerative and Divisive Calculation.