One of the most popular clustering algorithms is the celebrated $D^\alpha$ seeding algorithm (also know as $k$-means++ when $\alpha=2$) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an $O(2^{2\alpha}\cdot \log k)$-approximate solution to the ($k$,$\alpha$)-means cost (where euclidean distances are raised to the power $\alpha$) for any $\alpha\ge 1$. More recently, Balcan, Dick, and White (2018) observed experimentally that using $D^\alpha$ seeding with $\alpha>2$ can lead to a better solution with respect to the standard $k$-means objective (i.e. the $(k,2)$-means cost). In this paper, we provide a rigorous understanding of this phenomenon. For any $\alpha>2$, we show that $D^\alpha$ seeding guarantees in expectation an approximation factor of $$ O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right)$$ with respect to the standard $k$-means cost of any underlying clustering; where $g_\alpha$ is a parameter capturing the concentration of the points in each cluster, $\sigma_{\mathrm{max}}$ and $\sigma_{\mathrm{min}}$ are the maximum and minimum standard deviation of the clusters around their means, and $\ell$ is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of $2$). We complement these results by some lower bounds showing that the dependency on $g_\alpha$ and $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ is tight. Finally, we provide an experimental confirmation of the effects of the aforementioned parameters when using $D^\alpha$ seeding. Further, we corroborate the observation that $\alpha>2$ can indeed improve the $k$-means cost compared to $D^2$ seeding, and that this advantage remains even if we run Lloyd's algorithm after the seeding.
翻译:最流行的聚类算法之一是 Arthur 和 Vassilvitskii (2007) 提出的著名 $D^\alpha$ 初始化算法(当 $\alpha=2$ 时也称为 $k$-means++),他们证明该算法对任何 $\alpha\ge 1$ 均能保证预期实现 $O(2^{2\alpha}\cdot \log k)$ 近似解(相对于 $(k,\alpha)$-均值代价,其中欧氏距离被提升至 $\alpha$ 次幂)。最近,Balcan、Dick 和 White (2018) 通过实验观察到,使用 $\alpha>2$ 的 $D^\alpha$ 初始化可以在标准 $k$-均值目标(即 $(k,2)$-均值代价)下获得更优解。本文对此现象提供严格的理论解释。对于任意 $\alpha>2$,我们证明 $D^\alpha$ 初始化能保证(相对于任意潜在聚类的标准 $k$-均值代价)预期近似因子为 $$ O_\alpha \left((g_\alpha)^{2/\alpha}\cdot \left(\frac{\sigma_{\mathrm{max}}}{\sigma_{\mathrm{min}}}\right)^{2-4/\alpha}\cdot (\min\{\ell,\log k\})^{2/\alpha}\right)$$ 其中 $g_\alpha$ 是刻画各聚类中点集中程度的参数,$\sigma_{\mathrm{max}}$ 和 $\sigma_{\mathrm{min}}$ 为聚类围绕其均值分布的最大和最小标准差,$\ell$ 为潜在聚类中不同混合权重的数量(经四舍五入至2的最近幂次后)。我们通过下界分析补充了这些结果,表明对 $g_\alpha$ 和 $\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}$ 的依赖性是紧的。最后,我们通过实验验证了上述参数在使用 $D^\alpha$ 初始化时的影响,并进一步证实了以下观察结果:与 $D^2$ 初始化相比,$\alpha>2$ 确实能改善 $k$-均值代价,且该优势在初始化后运行 Lloyd 算法时仍能保持。