In large scale machine learning, random sampling is a popular way to approximate datasets by a small representative subset of examples. In particular, sensitivity sampling is an intensely studied technique which provides provable guarantees on the quality of approximation, while reducing the number of examples to the product of the VC dimension $d$ and the total sensitivity $\mathfrak S$ in remarkably general settings. However, guarantees going beyond this general bound of $\mathfrak S d$ are known in perhaps only one setting, for $\ell_2$ subspace embeddings, despite intense study of sensitivity sampling in prior work. In this work, we show the first bounds for sensitivity sampling for $\ell_p$ subspace embeddings for $p > 2$ that improve over the general $\mathfrak S d$ bound, achieving a bound of roughly $\mathfrak S^{2-2/p}$ for $2<p<\infty$. Furthermore, our techniques yield further new results in the study of sampling algorithms, showing that the root leverage score sampling algorithm achieves a bound of roughly $d$ for $1\leq p<2$, and that a combination of leverage score and sensitivity sampling achieves an improved bound of roughly $d^{2/p}\mathfrak S^{2-4/p}$ for $2<p<\infty$. Our sensitivity sampling results yield the best known sample complexity for a wide class of structured matrices that have small $\ell_p$ sensitivity.
翻译:在大规模机器学习中,随机采样是一种通过少量代表性样本子集来近似数据集的常用方法。特别地,敏感性采样是一种被深入研究的技巧,它在极其一般的设置下,能够将样本数量减少至 VC 维数 $d$ 与总敏感性 $\mathfrak S$ 的乘积,同时提供近似质量的可靠保证。然而,除了 $\ell_2$ 子空间嵌入这一可能唯一的情形外,目前已知超出该通用 $\mathfrak S d$ 界的保证极少,尽管已有大量关于敏感性采样的前期工作。本文首次给出当 $p > 2$ 时针对 $\ell_p$ 子空间嵌入的敏感性采样界,该界优于通用 $\mathfrak S d$ 界,当 $2<p<\infty$ 时达到了约 $\mathfrak S^{2-2/p}$ 的界。此外,我们的技术还进一步推动了采样算法的研究,表明对于 $1\leq p<2$,根杠杆分数采样算法能达到约 $d$ 的界;对于 $2<p<\infty$,杠杆分数与敏感性采样的组合能达到约 $d^{2/p}\mathfrak S^{2-4/p}$ 的改进界。我们的敏感性采样结果为一大类具有较小 $\ell_p$ 敏感性的结构化矩阵提供了已知最优的样本复杂度。