Improved Algorithms for Clustering with Noisy Distance Oracles

Bateni et al. has recently introduced the weak-strong distance oracle model to study clustering problems in settings with limited distance information. Given query access to the strong-oracle and weak-oracle in the weak-strong oracle model, the authors design approximation algorithms for $k$-means and $k$-center clustering problems. In this work, we design algorithms with improved guarantees for $k$-means and $k$-center clustering problems in the weak-strong oracle model. The $k$-means++ algorithm is routinely used to solve $k$-means in settings where complete distance information is available. One of the main contributions of this work is to show that $k$-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model. In particular, our $k$-means++ based algorithm gives a constant approximation for $k$-means and uses $O(k^2 \log^2{n})$ strong-oracle queries. This improves on the algorithm of Bateni et al. that uses $O(k^2 \log^4n \log^2 \log n)$ strong-oracle queries for a constant factor approximation of $k$-means. For the $k$-center problem, we give a simple ball-carving based $6(1 + ε)$-approximation algorithm that uses $O(k^3 \log^2{n} \log{\frac{\log{n}}ε})$ strong-oracle queries. This is an improvement over the $14(1 + ε)$-approximation algorithm of Bateni et al. that uses $O(k^2 \log^4{n} \log^2{\frac{\log{n}}ε})$ strong-oracle queries. To show the effectiveness of our algorithms, we perform empirical evaluations on real-world datasets and show that our algorithms significantly outperform the algorithms of Bateni et al.

翻译：Bateni等人最近引入了强弱距离预言机模型，以研究在距离信息受限场景下的聚类问题。给定强弱预言机模型中对强预言机和弱预言机的查询访问能力，作者设计了针对$k$-均值与$k$-中心聚类问题的近似算法。本文中，我们为强弱预言机模型下的$k$-均值与$k$-中心聚类问题设计了具有更优性能保证的算法。$k$-means++算法通常用于在具备完整距离信息的场景下求解$k$-均值问题。本研究的主要贡献之一是证明了$k$-means++算法可以通过仅使用少量强预言机查询（该模型中的关键资源）来适配强弱预言机模型。具体而言，我们基于$k$-means++的算法为$k$-均值问题提供了常数倍近似解，且仅使用$O(k^2 \log^2{n})$次强预言机查询。这改进了Bateni等人提出的算法——其为实现$k$-均值问题的常数倍近似需要$O(k^2 \log^4n \log^2 \log n)$次强预言机查询。针对$k$-中心问题，我们提出了一种基于球划分的简单$6(1 + ε)$-近似算法，该算法使用$O(k^3 \log^2{n} \log{\frac{\log{n}}ε})$次强预言机查询。相比Bateni等人提出的$14(1 + ε)$-近似算法（其需要$O(k^2 \log^4{n} \log^2{\frac{\log{n}}ε})$次强预言机查询），我们的算法实现了改进。为验证算法的有效性，我们在真实数据集上进行了实证评估，结果表明我们的算法显著优于Bateni等人提出的算法。