Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resource-constrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow exploratory data analysis of very large data sets.
翻译:层次凝聚聚类(HAC)可能是最早且最灵活的聚类方法,因为它可与多种距离度量、相似性指标及不同链接策略配合使用。该方法通常应用于数据集的簇数未知、且数据存在某种层次结构合理假设的场景。大多数HAC算法需要基于完整距离矩阵运行,因此需要二次方级别的内存空间。标准算法生成完整层次结构的运行时间复杂度为立方阶。在嵌入式系统或其他资源严重受限的环境中,内存消耗与运行时间问题尤为突出。本节介绍如何利用BETULA(著名BIRCH数据聚合算法的数值稳定版本)进行数据聚合,使HAC在资源受限系统中仅以少量聚类质量损失实现可行应用,从而支持超大数据集的探索性数据分析。