Clustering is one of the most fundamental tools in data science and machine learning, and k-means clustering is one of the most common such methods. There is a variety of approximate algorithms for the k-means problem, but computing the globally optimal solution is in general NP-hard. In this paper we consider the k-means problem for instances with low dimensional data and formulate it as a structured concave assignment problem. This allows us to exploit the low dimensional structure and solve the problem to global optimality within reasonable time for large data sets with several clusters. The method builds on iteratively solving a small concave problem and a large linear programming problem. This gives a sequence of feasible solutions along with bounds which we show converges to zero optimality gap. The paper combines methods from global optimization theory to accelerate the procedure, and we provide numerical results on their performance.
翻译:聚类是数据科学和机器学习中最基础的工具之一,而k-means聚类是最常用的方法之一。针对k-means问题存在多种近似算法,但计算全局最优解通常属于NP难问题。本文考虑低维数据场景下的k-means问题,并将其建模为结构化凹分配问题。这使得我们能够利用低维结构,在合理时间内为包含多个簇的大规模数据集求解全局最优解。该方法基于迭代求解一个小型凹问题与一个大型线性规划问题,从而生成可行解序列及其对应的最优性间隙界,并证明该间隙收敛于零。本文结合全局优化理论中的方法加速求解过程,并提供了数值实验结果以验证其性能。