Metric Clustering and MST with Strong and Weak Distance Oracles

We study optimization problems in a metric space $(\mathcal{X},d)$ where we can compute distances in two ways: via a ''strong'' oracle that returns exact distances $d(x,y)$, and a ''weak'' oracle that returns distances $\tilde{d}(x,y)$ which may be arbitrarily corrupted with some probability. This model captures the increasingly common trade-off between employing both an expensive similarity model (e.g. a large-scale embedding model), and a less accurate but cheaper model. Hence, the goal is to make as few queries to the strong oracle as possible. We consider both so-called ''point queries'', where the strong oracle is queried on a set of points $S \subset \mathcal{X} $ and returns $d(x,y)$ for all $x,y \in S$, and ''edge queries'' where it is queried for individual distances $d(x,y)$. Our main contributions are optimal algorithms and lower bounds for clustering and Minimum Spanning Tree (MST) in this model. For $k$-centers, $k$-median, and $k$-means, we give constant factor approximation algorithms with only $\tilde{O}(k)$ strong oracle point queries, and prove that $\Omega(k)$ queries are required for any bounded approximation. For edge queries, our upper and lower bounds are both $\tilde{\Theta}(k^2)$. Surprisingly, for the MST problem we give a $O(\sqrt{\log n})$ approximation algorithm using no strong oracle queries at all, and a matching $\Omega(\sqrt{\log n})$ lower bound. We empirically evaluate our algorithms, and show that their quality is comparable to that of the baseline algorithms that are given all true distances, but while querying the strong oracle on only a small fraction ($<1\%$) of points.

翻译：我们研究度量空间$(\mathcal{X},d)$中的优化问题，其中可通过两种方式计算距离：一种"强"预言机可返回精确距离$d(x,y)$，另一种"弱"预言机返回可能以一定概率被任意破坏的距离$\tilde{d}(x,y)$。该模型捕捉了日益常见的权衡场景：同时使用昂贵的相似度模型（如大规模嵌入模型）和精度较低但成本更低的模型。因此，目标是最小化对强预言机的查询次数。我们考虑两类查询：一是"点查询"，即对点集$S \subset \mathcal{X}$调用强预言机，返回所有$x,y \in S$的$d(x,y)$；二是"边查询"，即对单个距离$d(x,y)$进行查询。我们的主要贡献在于为该模型下的聚类与最小生成树问题提出最优算法与下界。针对$k$-中心、$k$-中位数和$k$-均值问题，我们给出了仅需$\tilde{O}(k)$次强预言机点查询的常数因子近似算法，并证明任意有界近似均需$\Omega(k)$次查询。对于边查询，上下界均为$\tilde{\Theta}(k^2)$。令人惊讶的是，针对MST问题，我们提出了一种无需任何强预言机查询的$O(\sqrt{\log n})$近似算法，并给出了匹配的$\Omega(\sqrt{\log n})$下界。通过实验评估，我们的算法质量与基于全量真实距离的基线算法相当，但对强预言机的点查询仅占全部点的极小比例（$<1\%$）。