In today's data-driven digital era, the amount as well as complexity, such as multi-view, non-Euclidean, and multi-relational, of the collected data are growing exponentially or even faster. Clustering, which unsupervisely extracts valid knowledge from data, is extremely useful in practice. However, existing methods are independently developed to handle one particular challenge at the expense of the others. In this work, we propose a simple but effective framework for complex data clustering (CDC) that can efficiently process different types of data with linear complexity. We first utilize graph filtering to fuse geometry structure and attribute information. We then reduce the complexity with high-quality anchors that are adaptively learned via a novel similarity-preserving regularizer. We illustrate the cluster-ability of our proposed method theoretically and experimentally. In particular, we deploy CDC to graph data of size 111M.
翻译:在当今数据驱动的数字化时代,所采集的数据量及其复杂性(例如多视图、非欧几里得、多关系)正呈指数级甚至更快的速度增长。聚类作为从数据中无监督提取有效知识的方法,在实践中极具价值。然而,现有方法通常针对某一特定挑战独立开发,却牺牲了其他方面的性能。本文提出一种简单但有效的复杂数据聚类框架(CDC),能以线性复杂度高效处理不同类型的数据。我们首先利用图滤波融合几何结构与属性信息,随后通过一种新颖的相似性保持正则化器自适应学习的高质量锚点来降低复杂度。我们从理论和实验两方面证明了所提方法的聚类能力。特别地,我们将CDC应用于规模达1.11亿节点的图数据。