Clustering is a basic task in data analysis and machine learning, and the optimization of clustering objectives are well-studied optimization problems; amongst these, the $k$-Means objective is arguably the most well known. Given a collection of points in a metric space, the goal is to partition them into $k$ clusters, each with an associated center, so as to minimize the sum of squared distances of points to their cluster centers. In this paper, we present a polynomial-time $3+2\sqrt{2}+ε<5.83$-approximation algorithm for $k$-Means in general metrics. This substantially improves on the current-best $(9+ε)$-approximation in [Ahmadian, Norouzi-Fard, Svensson, Ward - FOCS'17, SICOMP'20], and even slightly improves on the $5.92$-approximation in [Cohen-Addad, Esfandiari, Mirrokni, Narayanan - STOC'22] for the Euclidean special case. A natural approach for $k$-Means is to leverage Lagrangian Multiplier Preserving (LMP) approximations for the facility location problem. The previous best results for $k$-Means build upon an adaptation of an LMP $3$-approximation for facility location with metric connection costs in [Jain, Vazirani - J.ACM'01] based on a primal-dual method, rather than on the improved LMP greedy $2$-approximation for the same problem in [Jain, Mahdian, Markakis, Saberi, Vazirani - J.ACM'03]. The barrier to using the improved LMP algorithm was that no adaptation of this algorithm and its analysis to the case of squared metric connection costs was known (since squared distances violate triangle inequality). Our main contribution is overcoming this barrier by providing such an adaptation. This new LMP approximation algorithm is then combined with the framework recently introduced in [Cohen-Addad, Grandoni, Lee, Schwiegelshohn, Svensson - STOC'25] for the related (metric) $k$-Median problem.
翻译:聚类是数据分析与机器学习中的基本任务,聚类目标的优化是研究充分的优化问题;其中$k$-均值目标问题无疑是最著名的。给定度规空间中的一组点,目标是将其划分为$k$个簇,每个簇关联一个中心点,使点到其簇中心的平方距离之和最小化。本文提出一种适用于一般度规空间的$k$-均值多项式时间$3+2\sqrt{2}+ε<5.83$近似算法。该结果显著改进了[Ahmadian, Norouzi-Fard, Svensson, Ward - FOCS'17, SICOMP'20]中当前最优的$(9+ε)$近似解,甚至略微优于[Cohen-Addad, Esfandiari, Mirrokni, Narayanan - STOC'22]中针对欧几里得特例的$5.92$近似解。$k$-均值的自然方法是利用设施选址问题的拉格朗日乘子保持(LMP)近似。先前$k$-均值的最优结果基于[Jain, Vazirani - J.ACM'01]中采用原始对偶方法、针对带度规连接成本的设施选址问题提出的LMP $3$近似算法的改编,而非[Jain, Mahdian, Markakis, Saberi, Vazirani - J.ACM'03]中针对同一问题改进的LMP贪婪$2$近似算法。使用改进LMP算法的障碍在于,此前尚不清楚该算法及其分析如何适用于带平方度规连接成本的情形(因为平方距离违反三角不等式)。我们的主要贡献是通过提供此类改编方案克服了这一障碍。随后,这种新型LMP近似算法与[Cohen-Addad, Grandoni, Lee, Schwiegelshohn, Svensson - STOC'25]近期提出的针对相关(度规)$k$-中位数问题的框架相结合。