The $\ell_p$ subspace approximation problem is an NP-hard low rank approximation problem that generalizes the median hyperplane ($p = 1$), principal component analysis ($p = 2$), and center hyperplane problems ($p = \infty$). A popular approach to cope with the NP-hardness is to compute a strong coreset, which is a weighted subset of input points that simultaneously approximates the cost of every $k$-dimensional subspace, typically to $(1+\epsilon)$ relative error for a small constant $\epsilon$. We obtain an algorithm for constructing a strong coreset for $\ell_p$ subspace approximation of size $\tilde O(k\epsilon^{-4/p})$ for $p<2$ and $\tilde O(k^{p/2}\epsilon^{-p})$ for $p>2$. This offers the following improvements over prior work: - We construct the first strong coresets with nearly optimal dependence on $k$ for all $p\neq 2$. In prior work, [SW18] constructed coresets of modified points with a similar dependence on $k$, while [HV20] constructed true coresets with polynomially worse dependence on $k$. - We recover or improve the best known $\epsilon$ dependence for all $p$. In particular, for $p > 2$, the [SW18] coreset of modified points had a dependence of $\epsilon^{-p^2/2}$ and the [HV20] coreset had a dependence of $\epsilon^{-3p}$. Our algorithm is based on sampling by root ridge leverage scores, which admits fast algorithms, especially for sparse or structured matrices. Our analysis avoids the use of the representative subspace theorem [SW18], which is a critical component of all prior dimension-independent coresets for $\ell_p$ subspace approximation. Our techniques also lead to the first nearly optimal online strong coresets for $\ell_p$ subspace approximation with similar bounds as the offline setting, resolving a problem of [WY23]. All prior approaches lose $\mathrm{poly}(k)$ factors in this setting, even when allowed to modify the original points.
翻译:$\ell_p$子空间近似问题是一个NP难的低秩近似问题,它推广了中位超平面问题($p = 1$)、主成分分析($p = 2$)以及中心超平面问题($p = \infty$)。应对该NP难问题的一种常用方法是计算强核心集,即输入点的一个加权子集,它能同时近似每个$k$维子空间的代价,通常对于较小的常数$\epsilon$,能达到$(1+\epsilon)$的相对误差。我们提出一种算法,可为$\ell_p$子空间近似构建大小为$\tilde O(k\epsilon^{-4/p})$(当$p<2$时)和$\tilde O(k^{p/2}\epsilon^{-p})$(当$p>2$时)的强核心集。相较于先前工作,本算法实现了以下改进:- 对于所有$p\neq 2$的情况,我们首次构建了在$k$上具有近乎最优依赖性的强核心集。在先前工作中,[SW18]构建了基于修正点的核心集,其对$k$的依赖性与我们类似,而[HV20]构建的真实核心集对$k$的依赖性在多项式意义上更差。- 对于所有$p$,我们恢复或改进了已知最佳的$\epsilon$依赖性。特别地,对于$p > 2$,[SW18]基于修正点的核心集对$\epsilon$的依赖性为$\epsilon^{-p^2/2}$,而[HV20]的核心集依赖性为$\epsilon^{-3p}$。我们的算法基于根脊杠杆得分采样,这为稀疏或结构化矩阵提供了快速算法。我们的分析避免了使用代表性子空间定理[SW18],该定理是所有先前$\ell_p$子空间近似的维度无关核心集的关键组成部分。我们的技术还首次为$\ell_p$子空间近似带来了近乎最优的在线强核心集,其边界与离线设置相似,从而解决了[WY23]提出的一个问题。所有先前方法在此设置下均会损失$\mathrm{poly}(k)$因子,即使允许修改原始点也是如此。