For a set of points in $\mathbb{R}^d$, the Euclidean $k$-means problems consists of finding $k$ centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the coreset provides a guarantee almost equivalent to running it on the full data. In this work, we study coresets in a fully-dynamic setting: points are added and deleted with the goal to efficiently maintain a coreset with which a k-means solution can be computed. Based on an algorithm from Henzinger and Kale [ESA'20], we present an efficient and practical implementation of a fully dynamic coreset algorithm, that improves the running time by up to a factor of 20 compared to our non-optimized implementation of the algorithm by Henzinger and Kale, without sacrificing more than 7% on the quality of the k-means solution.
翻译:对于$\mathbb{R}^d$空间中的点集,欧氏k-means问题旨在寻找k个中心,使得每个数据点到其最近中心的距离平方和最小化。核心集是近年来在大数据背景下解决该问题的主要工具之一,它能在压缩初始数据集的同时保持数据结构:在核心集上运行任何算法所获得的保证几乎等同于在全数据集上运行该算法。本研究在完全动态场景下研究核心集:通过点的增删操作,目标是高效维护一个可计算k-means解的核心集。基于Henzinger与Kale [ESA'20] 的算法,我们提出了一种高效且实用的完全动态核心集算法实现。与Henzinger和Kale算法的未优化实现相比,本方案将运行时间提升至多20倍,同时k-means解的质量损失不超过7%。