Spectral clustering became a popular choice for data clustering for its ability of uncovering clusters of different shapes. However, it is not always preferable over other clustering methods due to its computational demands. One of the effective ways to bypass these computational demands is to perform spectral clustering on a subset of points (data representatives) then generalize the clustering outcome, this is known as approximate spectral clustering (ASC). ASC uses sampling or quantization to select data representatives. This makes it vulnerable to 1) performance inconsistency (since these methods have a random step either in initialization or training), 2) local statistics loss (because the pairwise similarities are extracted from data representatives instead of data points). We proposed a refined version of $k$-nearest neighbor graph, in which we keep data points and aggressively reduce number of edges for computational efficiency. Local statistics were exploited to keep the edges that do not violate the intra-cluster distances and nullify all other edges in the $k$-nearest neighbor graph. We also introduced an optional step to automatically select the number of clusters $C$. The proposed method was tested on synthetic and real datasets. Compared to ASC methods, the proposed method delivered a consistent performance despite significant reduction of edges.
翻译:谱聚类因其能够揭示不同形状的聚类结构而成为数据聚类的热门选择。然而,由于计算需求较高,它并非总是优于其他聚类方法。绕过这些计算需求的有效方法之一是对数据子集(数据代表点)进行谱聚类,然后将聚类结果泛化,这被称为近似谱聚类(ASC)。ASC 通过采样或量化来选择数据代表点,这使其存在以下缺陷:1)性能不一致性(因为这些方法在初始化或训练过程中包含随机步骤);2)局部统计信息丢失(因为成对相似性是从数据代表点而非原始数据点中提取的)。我们提出了一种精炼版的 $k$ 近邻图,其中保留数据点并大幅减少边的数量以提高计算效率。通过利用局部统计信息,保留不违反簇内距离的边,同时移除 $k$ 近邻图中所有其他边。我们还引入了一个可选步骤来自动选择聚类数量 $C$。在合成数据集和真实数据集上测试了所提方法。与 ASC 方法相比,尽管大幅减少了边的数量,所提方法仍能保持一致的性能。