We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space, and a label for each data point given by an oracle indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $\alpha \in (0,c)$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $\mathrm{OPT}$. For a dataset of size $m$, we propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost compared to the previous randomized algorithm while preserving the $O( d m \log m)$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $\alpha$ up to $1/2$, an improvement over $\alpha$ being at most $1/7$ in the previous work. For the $k$-medians problem we improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $\alpha$ to get a cost of $(1+O(\alpha))\mathrm{OPT}$, while requiring essentially just $O(md \log^3 m/\alpha)$ runtime.
翻译:我们考虑学习增强设置下的聚类问题,其中给定 $d$ 维欧几里得空间中的一个数据集,以及由预言机为每个数据点提供的标签,指示哪些数据点子集应聚类在一起。该设置涵盖了当我们拥有与聚类目标相关的数据集辅助信息(例如神经网络输出的标签)时的场景。沿用先前工作,我们假设每个预测簇中最多存在比例为 $\alpha \in (0,c)$(其中 $c<1$)的假正例和假负例,否则标签将实现最优聚类代价 $\mathrm{OPT}$。对于规模为 $m$ 的数据集,我们提出一种确定性 $k$-means 算法,该算法在保持 $O(d m \log m)$ 运行时间的同时,生成的聚类中心代价边界相比先前的随机算法有所改进。此外,即使预测不十分准确,我们的算法仍能工作——即当 $\alpha$ 高达 $1/2$ 时,代价边界仍成立,相比先前工作中 $\alpha$ 最多为 $1/7$ 的限制有所提升。针对 $k$-medians 问题,我们改进了先前工作,将近似因子对精度参数 $\alpha$ 的依赖关系实现了双二次改进,从而获得 $(1+O(\alpha))\mathrm{OPT}$ 的代价,且所需运行时间仅约为 $O(md \log^3 m/\alpha)$。