Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of $n$ samples $P$ drawn independently from some unknown, but fixed distribution $\mathcal{D}$, how quickly does a solution computed on $P$ converge to the optimal clustering of $\mathcal{D}$? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of $\tilde{O}\left(\sqrt{{k}/{n}}\right)$. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means and extends it to other important objectives such as $k$-median. For subspace clustering with $j$-dimensional subspaces, we show a convergence rate of $\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)$. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes $k$-means, we show a convergence rate of $\Omega\left(\sqrt{\frac{kj}{n}}\right)$ is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal.
翻译:给定一个点集,聚类旨在找到点集的一个划分,将其分为$k$个簇,使得每个点被分配到的中心尽可能近。最常见情况下,中心是点本身,从而得到著名的$k$-中位数和$k$-均值目标。也可以选择中心为$j$维子空间,这便产生了子空间聚类。本文考虑这些问题的学习界,即给定从某个未知但固定的分布$\mathcal{D}$中独立抽取的$n$个样本$P$,基于$P$计算得到的解收敛到$\mathcal{D}$的最优聚类的速度有多快?我们给出了若干近乎最优的结果。特别地,对于基于中心的目标,我们证明了一个$\tilde{O}\left(\sqrt{{k}/{n}}\right)$的收敛速率。这匹配了[Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]和[Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998]中已知的关于$k$-均值的最优界,并将其扩展到其他重要目标(如$k$-中位数)。对于采用$j$维子空间的子空间聚类,我们证明了$\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)$的收敛速率。这些是大多数此类问题的首个可证明界。针对投影聚类(它是$k$-均值的推广)这一特例,我们证明了下界$\Omega\left(\sqrt{\frac{kj}{n}}\right)$是必要的,从而表明[Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]中的界本质上是紧的。