We introduce the first iterative algorithm for constructing a $\varepsilon$-coreset that guarantees deterministic $\ell_p$ subspace embedding for any $p \in [1,\infty)$ and any $\varepsilon > 0$. For a given full rank matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ where $n \gg d$, $\mathbf{X}' \in \mathbb{R}^{m \times d}$ is an $(\varepsilon,\ell_p)$-subspace embedding of $\mathbf{X}$, if for every $\mathbf{q} \in \mathbb{R}^d$, $(1-\varepsilon)\|\mathbf{Xq}\|_{p}^{p} \leq \|\mathbf{X'q}\|_{p}^{p} \leq (1+\varepsilon)\|\mathbf{Xq}\|_{p}^{p}$. Specifically, in this paper, $\mathbf{X}'$ is a weighted subset of rows of $\mathbf{X}$ which is commonly known in the literature as a coreset. In every iteration, the algorithm ensures that the loss on the maintained set is upper and lower bounded by the loss on the original dataset with appropriate scalings. So, unlike typical coreset guarantees, due to bounded loss, our coreset gives a deterministic guarantee for the $\ell_p$ subspace embedding. For an error parameter $\varepsilon$, our algorithm takes $O(\mathrm{poly}(n,d,\varepsilon^{-1}))$ time and returns a deterministic $\varepsilon$-coreset, for $\ell_p$ subspace embedding whose size is $O\left(\frac{d^{\max\{1,p/2\}}}{\varepsilon^{2}}\right)$. Here, we remove the $\log$ factors in the coreset size, which had been a long-standing open problem. Our coresets are optimal as they are tight with the lower bound. As an application, our coreset can also be used for approximately solving the $\ell_p$ regression problem in a deterministic manner.
翻译:本文首次提出一种迭代算法,用于构建$\varepsilon$-核心集,该算法对任意$p \in [1,\infty)$和任意$\varepsilon > 0$均能保证确定性的$\ell_p$子空间嵌入。给定满秩矩阵$\mathbf{X} \in \mathbb{R}^{n \times d}$(其中$n \gg d$),若对任意$\mathbf{q} \in \mathbb{R}^d$满足$(1-\varepsilon)\|\mathbf{Xq}\|_{p}^{p} \leq \|\mathbf{X'q}\|_{p}^{p} \leq (1+\varepsilon)\|\mathbf{Xq}\|_{p}^{p}$,则称$\mathbf{X}' \in \mathbb{R}^{m \times d}$为$\mathbf{X}$的$(\varepsilon,\ell_p)$-子空间嵌入。特别地,本文中的$\mathbf{X}'$是$\mathbf{X}$行的加权子集,即文献中通常所称的核心集。该算法在每次迭代中确保维护集上的损失在适当缩放后,其上界与下界均受原始数据集损失的控制。因此,与典型的核心集保证不同,由于损失有界,我们的核心集为$\ell_p$子空间嵌入提供了确定性保证。对于误差参数$\varepsilon$,本算法耗时$O(\mathrm{poly}(n,d,\varepsilon^{-1}))$,并返回一个确定性的$\varepsilon$-核心集,其规模为$O\left(\frac{d^{\max\{1,p/2\}}}{\varepsilon^{2}}\right)$,用于$\ell_p$子空间嵌入。我们消除了核心集规模中的$\log$因子,这曾是一个长期悬而未决的问题。我们的核心集是最优的,因其与下界紧致匹配。作为应用,该核心集还可用于以确定性方式近似求解$\ell_p$回归问题。