Constructing a similarity graph from a set $X$ of data points in $\mathbb{R}^d$ is the first step of many modern clustering algorithms. However, typical constructions of a similarity graph have high time complexity, and a quadratic space dependency with respect to $|X|$. We address this limitation and present a new algorithmic framework that constructs a sparse approximation of the fully connected similarity graph while preserving its cluster structure. Our presented algorithm is based on the kernel density estimation problem, and is applicable for arbitrary kernel functions. We compare our designed algorithm with the well-known implementations from the scikit-learn library and the FAISS library, and find that our method significantly outperforms the implementation from both libraries on a variety of datasets.
翻译:从数据点集合 $X \subset \mathbb{R}^d$ 构建相似图是许多现代聚类算法的第一步。然而,相似图的典型构建方法具有较高的时间复杂度,并且空间复杂度与 $|X|$ 呈二次关系。针对这一局限性,我们提出一种新的算法框架,能够在保留完全连接相似图的聚类结构的同时,构建其稀疏近似。所提出的算法基于核密度估计问题,适用于任意核函数。我们将设计的算法与 scikit-learn 库和 FAISS 库中广为人知的实现进行对比,结果表明,在各种数据集上,我们的方法显著优于这两个库的实现。