High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.
翻译:高维数据集通常在不同子空间中包含多个有意义的聚类结果。例如,物体可以根据颜色、重量或尺寸进行聚类,从而揭示给定数据集的不同解释。多种方法能够识别此类非冗余聚类。然而,这些方法大多要求用户指定预期的子空间数量以及每个子空间中的聚类数量。陈述这些值是一个非平凡问题,通常需要对输入数据集有详细了解。在本文中,我们提出一个利用最小描述长度原理(MDL)自动检测子空间数量及每个子空间中聚类数量的框架。我们描述了一种高效的过程,通过拆分与合并子空间及子空间内的聚类来贪婪地搜索参数空间。此外,引入了一种编码策略,使我们能够在每个子空间中检测异常值。大量实验表明,我们的方法与最先进的方法具有高度竞争力。