An Analysis of $D^α$ seeding for $k$-means

One of the most popular clustering algorithms is the celebrated $D^\alpha$ seeding algorithm (also know as $k$-means++ when $\alpha=2$) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an $O(2^{2\alpha}\cdot \log k)$-approximate solution to the ($k$,$\alpha$)-means cost (where euclidean distances are raised to the power $\alpha$) for any $\alpha\ge 1$. More recently, Balcan, Dick, and White (2018) observed experimentally that using $D^\alpha$ seeding with $\alpha>2$ can lead to a better solution with respect to the standard $k$-means objective (i.e. the $(k,2)$-means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any $\alpha>2$, we show that $D^\alpha$ seeding guarantees in expectation an approximation factor of $$ O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right)$$ with respect to the standard $k$-means cost of any underlying clustering; where $g_\alpha$ is a parameter capturing the concentration of the points in each cluster, $\sigma_{\mathrm{max}}$ and $\sigma_{\mathrm{min}}$ are the maximum and minimum standard deviation of the clusters around their means, and $\ell$ is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of $2$). We complement these results by some lower bounds showing that the dependency on $g_\alpha$ and $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using $D^\alpha$ seeding. Further, we corroborate the observation that $\alpha>2$ can indeed improve the $k$-means cost compared to $D^2$ seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.

翻译：最流行的聚类算法之一是 Arthur 和 Vassilvitskii (2007) 提出的著名 $D^\alpha$ 初始化算法（当 $\alpha=2$ 时也称为 $k$-means++），他们证明该算法对任何 $\alpha\ge 1$ 均能保证预期实现 $O(2^{2\alpha}\cdot \log k)$ 近似解（相对于 $(k,\alpha)$-均值代价，其中欧氏距离被提升至 $\alpha$ 次幂）。最近，Balcan、Dick 和 White (2018) 通过实验观察到，使用 $\alpha>2$ 的 $D^\alpha$ 初始化可以在标准 $k$-均值目标（即 $(k,2)$-均值代价）下获得更优解。本文对此现象提供严格的理论解释。对于任意 $\alpha>2$，我们证明 $D^\alpha$ 初始化能保证（相对于任意潜在聚类的标准 $k$-均值代价）预期近似因子为 $$ O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right)$$ 其中 $g_\alpha$ 是刻画各聚类中点集中程度的参数，$\sigma_{\mathrm{max}}$ 和 $\sigma_{\mathrm{min}}$ 为聚类围绕其均值分布的最大和最小标准差，$\ell$ 为潜在聚类中不同混合权重的数量（经四舍五入至2的最近幂次后）。我们通过下界分析补充了这些结果，表明对 $g_\alpha$ 和 $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ 的依赖性是紧的。最后，我们通过实验验证了上述参数在使用 $D^\alpha$ 初始化时的影响，并进一步证实了以下观察结果：与 $D^2$ 初始化相比，$\alpha>2$ 确实能改善 $k$-均值代价，且该优势在初始化后运行 Lloyd 算法时仍能保持。