Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for $k$-clustering (such as $k$-median and $k$-means) assume access to exact pairwise distances -- an unrealistic requirement in many modern applications. We study clustering in the \emph{Rank-model (R-model)}, where access to distances is entirely replaced by a \emph{quadruplet oracle} that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with $n$ input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of $O(k \cdot \mathsf{polylog}(n))$ centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum $k$-clustering cost. Our method achieves a query complexity of $O(n\cdot k \cdot \mathsf{polylog}(n))$ for arbitrary metric spaces and improves to $O((n+k^2) \cdot \mathsf{polylog}(n))$ when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to $1+\varepsilon$, for any arbitrarily small constant $\varepsilon\in(0,1)$, while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms.
翻译:聚类是无监督学习中的基本原语。然而,经典的$k$聚类算法(如$k$-中位数和$k$-均值)假设能够访问精确的成对距离——这在许多现代应用中是不切实际的要求。我们研究在\emph{排序模型(R模型)}中的聚类问题,其中对距离的访问完全被一个仅提供相对距离比较的\emph{四元组预言机}所取代。在实践中,这样的预言机可以代表学习到的模型或人类反馈,并且预期存在噪声并涉及访问成本。给定一个包含$n$个输入项的度量空间,我们设计了随机算法,仅使用一个带噪声的四元组预言机,计算一组$O(k \cdot \mathsf{polylog}(n))$中心点以及从输入项到这些中心点的映射,使得该映射的聚类成本至多是最优$k$聚类成本的常数倍。我们的方法在任意度量空间下实现了$O(n\cdot k \cdot \mathsf{polylog}(n))$的查询复杂度,并在底层度量具有有界倍增维度时改进为$O((n+k^2) \cdot \mathsf{polylog}(n))$。当度量具有有界倍增维度时,我们还可以将近似比从常数进一步改进为$1+\varepsilon$,对于任意小的常数$\varepsilon\in(0,1)$,同时保持相同的渐近查询复杂度。我们的框架展示了如何将带噪声、低成本的预言机(例如源自大型语言模型的预言机)系统地集成到可扩展的聚类算法中。