Clustering data objects into homogeneous groups is one of the most important tasks in data mining. Spectral clustering is arguably one of the most important algorithms for clustering, as it is appealing for its theoretical soundness and is adaptable to many real-world data settings. For example, mixed data, where the data is composed of numerical and categorical features, is typically handled via numerical discretization, dummy coding, or similarity computation that takes into account both data types. This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm, avoiding the need for data preprocessing or the use of sophisticated similarity functions. We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function. Furthermore, we demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data. Finally, we compare the performance of our algorithms against other related methods and show that it provides a competitive alternative to them in terms of performance and runtime.
翻译:将数据对象聚类为同质组是数据挖掘中最重要任务之一。谱聚类因其理论严谨性并能适应多种真实数据场景,堪称最重要的聚类算法之一。例如,当数据由数值型与类别型特征组成时,混合数据通常通过数值离散化、虚拟编码或考虑两种数据类型的相似度计算来处理。本文探索了一种更自然的方式,将数值与类别信息融入谱聚类算法,从而避免数据预处理或使用复杂相似度函数。我们提出为数据可能归属的不同类别添加额外节点,并证明该做法可推导出可解释的聚类目标函数。此外,我们证明该简单框架能针对纯类别型数据实现线性时间谱聚类算法。最后,我们将所提算法的性能与其他相关方法进行比较,结果表明其在性能与运行时间方面均具有竞争力的替代方案。