On Generalization Bounds for Projective Clustering

Given a set of points, clustering consists of finding a partition of a point set into $k$ clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous $k$-median and $k$-means objectives. One may also choose centers to be $j$ dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of $n$ samples $P$ drawn independently from some unknown, but fixed distribution $\mathcal{D}$, how quickly does a solution computed on $P$ converge to the optimal clustering of $\mathcal{D}$? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of $\tilde{O}\left(\sqrt{{k}/{n}}\right)$. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for $k$-means and extends it to other important objectives such as $k$-median. For subspace clustering with $j$-dimensional subspaces, we show a convergence rate of $\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)$. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes $k$-means, we show a convergence rate of $\Omega\left(\sqrt{\frac{kj}{n}}\right)$ is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal.

翻译：给定一个点集，聚类旨在找到点集的一个划分，将其分为$k$个簇，使得每个点被分配到的中心尽可能近。最常见情况下，中心是点本身，从而得到著名的$k$-中位数和$k$-均值目标。也可以选择中心为$j$维子空间，这便产生了子空间聚类。本文考虑这些问题的学习界，即给定从某个未知但固定的分布$\mathcal{D}$中独立抽取的$n$个样本$P$，基于$P$计算得到的解收敛到$\mathcal{D}$的最优聚类的速度有多快？我们给出了若干近乎最优的结果。特别地，对于基于中心的目标，我们证明了一个$\tilde{O}\left(\sqrt{{k}/{n}}\right)$的收敛速率。这匹配了[Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]和[Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998]中已知的关于$k$-均值的最优界，并将其扩展到其他重要目标（如$k$-中位数）。对于采用$j$维子空间的子空间聚类，我们证明了$\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)$的收敛速率。这些是大多数此类问题的首个可证明界。针对投影聚类（它是$k$-均值的推广）这一特例，我们证明了下界$\Omega\left(\sqrt{\frac{kj}{n}}\right)$是必要的，从而表明[Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]中的界本质上是紧的。