We study beyond worst case analysis for the $k$-means problem where the goal is to model typical instances of $k$-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to $k$-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension $D$ concentrates around a low dimensional manifold of intrinsic dimension $d$, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent $\varepsilon = 2/d$ using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called $\operatorname{Qkmeans}$ which provides $O(ρ^{-2} \log k)$ approximate solutions to the $k$-means problem in time $O(nD) + \widetilde{O}(\varepsilon^{1+ρ}ρ^{-1}k^{1+γ})$; where the exponent $γ= \varepsilon + ρ$ for an input parameter $ρ< 1$. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.
翻译:我们研究 $k$-means 问题的超越最坏情况分析,其目标是对实践中出现的典型 $k$-means 实例进行建模。现有的理论方法在 $k$-means 最优解的某些假设下提供保证,这使得这些假设在实践中难以验证。我们提出流形假设,即环境维度 $D$ 中获取的数据集中在内在维度 $d$ 的低维流形周围,作为对现实世界聚类实例建模的合理假设。利用最优量化理论的技术,我们识别了数据集的关键几何特性,这些特性具有理论上可预测的标度律,其依赖于量化指数 $\varepsilon = 2/d$。我们展示了如何利用这些规律性来设计一种名为 $\operatorname{Qkmeans}$ 的快速种子初始化方法,该方法能在 $O(nD) + \widetilde{O}(\varepsilon^{1+ρ}ρ^{-1}k^{1+γ})$ 时间内为 $k$-means 问题提供 $O(ρ^{-2} \log k)$ 近似解;其中指数 $γ= \varepsilon + ρ$,$ρ< 1$ 为输入参数。这使我们能够获得新的运行时间与质量权衡。我们在多个领域进行了大规模实证研究,以验证我们的理论预测和算法性能,从而为超越最坏情况的数据聚类架起理论与实践的桥梁。