In large scale machine learning, random sampling is a popular way to approximate datasets by a small representative subset of examples. In particular, sensitivity sampling is an intensely studied technique which provides provable guarantees on the quality of approximation, while reducing the number of examples to the product of the VC dimension $d$ and the total sensitivity $\mathfrak S$ in remarkably general settings. However, guarantees going beyond this general bound of $\mathfrak S d$ are known in perhaps only one setting, for $\ell_2$ subspace embeddings, despite intense study of sensitivity sampling in prior work. In this work, we show the first bounds for sensitivity sampling for $\ell_p$ subspace embeddings for $p\neq 2$ that improve over the general $\mathfrak S d$ bound, achieving a bound of roughly $\mathfrak S^{2/p}$ for $1\leq p<2$ and $\mathfrak S^{2-2/p}$ for $2<p<\infty$. For $1\leq p<2$, we show that this bound is tight, in the sense that there exist matrices for which $\mathfrak S^{2/p}$ samples is necessary. Furthermore, our techniques yield further new results in the study of sampling algorithms, showing that the root leverage score sampling algorithm achieves a bound of roughly $d$ for $1\leq p<2$, and that a combination of leverage score and sensitivity sampling achieves an improved bound of roughly $d^{2/p}\mathfrak S^{2-4/p}$ for $2<p<\infty$. Our sensitivity sampling results yield the best known sample complexity for a wide class of structured matrices that have small $\ell_p$ sensitivity.
翻译:在大规模机器学习中,随机采样是一种通过少量代表性样本近似数据集的常用方法。特别地,敏感性采样是一种被深入研究的技巧,它在极一般的设定下,能够将样本数量减少至VC维 $d$ 与总敏感性 $\mathfrak S$ 的乘积,同时提供近似质量的可证明保证。然而,尽管以往工作中对敏感性采样已有大量研究,但超越这一通用 $\mathfrak S d$ 界的保证或许仅在一种设定(即 $\ell_2$ 子空间嵌入)中已知。本文针对 $p\neq 2$ 的 $\ell_p$ 子空间嵌入,首次给出了优于通用 $\mathfrak S d$ 界的敏感性采样界,对于 $1\leq p<2$ 达到约 $\mathfrak S^{2/p}$ 的界,对于 $2<p<\infty$ 达到约 $\mathfrak S^{2-2/p}$ 的界。对于 $1\leq p<2$,我们证明该界是紧的——即存在矩阵需要至少 $\mathfrak S^{2/p}$ 个样本。此外,我们的技术为采样算法的研究带来了新的结果:对于 $1\leq p<2$,根杠杆评分采样算法能达到约 $d$ 的界;对于 $2<p<\infty$,结合杠杆评分与敏感性采样能达到约 $d^{2/p}\mathfrak S^{2-4/p}$ 的改进界。我们的敏感性采样结果为具有小 $\ell_p$ 敏感性的广泛结构化矩阵类提供了目前最优的样本复杂度。