Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark.Compared to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.
翻译:邻域图是文本嵌入谱聚类中关键但往往脆弱的一步。在真实文本数据集上,标准的$k$-NN图在实际稀疏度水平(较小的$k$)下可能包含许多不连通分量,导致谱聚类退化并对超参数敏感。我们提出一种简单的增量$k$-NN图构建方法,该方法通过设计保持连通性:每个新节点会连接到其$k$个最近邻的已插入节点,这保证了任意$k$值下的连通图。我们提供了连通性的归纳证明,并讨论了新文档到达时增量更新的意义。我们在Massive Text Embedding Benchmark的六个聚类数据集上,使用SentenceTransformer嵌入和拉普拉斯特征映射进行谱聚类验证。与标准$k$-NN图相比,我们的方法在不连通分量普遍存在的低$k$区域表现更优,并在较大$k$值时与标准$k$-NN图性能相当。