Subset selection for the rank $k$ approximation of an $n\times d$ matrix $A$ offers improvements in the interpretability of matrices, as well as a variety of computational savings. This problem is well-understood when the error measure is the Frobenius norm, with various tight algorithms known even in challenging models such as the online model, where an algorithm must select the column subset irrevocably when the columns arrive one by one. In contrast, for other matrix losses, optimal trade-offs between the subset size and approximation quality have not been settled, even in the offline setting. We give a number of results towards closing these gaps. In the offline setting, we achieve nearly optimal bicriteria algorithms in two settings. First, we remove a $\sqrt k$ factor from a result of [SWZ19] when the loss function is any entrywise loss with an approximate triangle inequality and at least linear growth. Our result is tight for the $\ell_1$ loss. We give a similar improvement for entrywise $\ell_p$ losses for $p>2$, improving a previous distortion of $k^{1-1/p}$ to $k^{1/2-1/p}$. Our results come from a technique which replaces the use of a well-conditioned basis with a slightly larger spanning set for which any vector can be expressed as a linear combination with small Euclidean norm. We show that this technique also gives the first oblivious $\ell_p$ subspace embeddings for $1<p<2$ with $\tilde O(d^{1/p})$ distortion, which is nearly optimal and closes a long line of work. In the online setting, we give the first online subset selection algorithm for $\ell_p$ subspace approximation and entrywise $\ell_p$ low rank approximation by implementing sensitivity sampling online, which is challenging due to the sequential nature of sensitivity sampling. Our main technique is an online algorithm for detecting when an approximately optimal subspace changes substantially.
翻译:对于 $n\times d$ 矩阵 $A$ 的秩 $k$ 逼近的子集选择,既能提升矩阵的可解释性,也可带来多种计算优势。当误差度量采用 Frobenius 范数时,该问题已被充分理解:即使在在线模型(要求算法在列逐次到达时不可撤销地选择列子集)等具有挑战性的模型下,也已存在多种紧的算法。相比之下,对于其他矩阵损失函数,即使在离线场景下,子集规模与逼近质量之间的最优权衡仍未得到解决。我们给出了一系列成果以弥合这些差距。在离线场景中,我们在两种设定下实现了近乎最优的双准则算法。首先,针对损失函数为任意满足近似三角不等式且至少具有线性增长的条目损失的情形,我们消除了[SWZ19]结果中的一个 $\sqrt k$ 因子;对于 $\ell_1$ 损失,我们的结果是紧的。对于 $p>2$ 的条目 $\ell_p$ 损失,我们实现了类似改进,将此前 $k^{1-1/p}$ 的变形改进为 $k^{1/2-1/p}$。我们的成果源于一种技术:用稍大的生成集替代良态基,使得任意向量均可表示为具有小欧几里得范数的线性组合。我们证明该技术还首次给出了 $1<p<2$ 时 $\tilde O(d^{1/p})$ 变形的 oblivious $\ell_p$ 子空间嵌入(该界近乎最优且终结了长期的一系列工作)。在在线场景中,我们通过在线实现敏感度采样(该采样因顺序特性而具有挑战性),首次给出了针对 $\ell_p$ 子空间逼近与条目 $\ell_p$ 低秩逼近的在线子集选择算法。我们的主要技术是一种检测近似最优子空间何时发生显著变化的在线算法。