A coreset of a dataset with $n$ examples and $d$ features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and $\ell_p$ linear regression with a single response are known in prior work. However, for multiple $\ell_p$ regression where there can be $m$ responses, there are no known constructions with size sublinear in $m$. In this work, we construct coresets of size $\tilde O(\varepsilon^{-2}d)$ for $p<2$ and $\tilde O(\varepsilon^{-p}d^{p/2})$ for $p>2$ independently of $m$ (i.e., dimension-free) that approximate the multiple $\ell_p$ regression objective at every point in the domain up to $(1\pm\varepsilon)$ relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an $\varepsilon$ factor for all $p>1$. All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate $\ell_p$ Euclidean power means up to a $(1+\varepsilon)$ factor, showing that $\tilde\Theta(\varepsilon^{-2})$ samples for $p = 1$, $\tilde\Theta(\varepsilon^{-1})$ samples for $1 < p < 2$, and $\tilde\Theta(\varepsilon^{1-p})$ samples for $p>2$ is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for $1<p<2$, every matrix has a subset of $\tilde O(\varepsilon^{-1}k)$ rows which spans a $(1+\varepsilon)$-approximately optimal $k$-dimensional subspace for $\ell_p$ subspace approximation, which is also nearly optimal.
翻译:数据集包含 $n$ 个样本和 $d$ 个特征,其核心集是一个加权的样本子集,足以用于解决下游数据分析任务。先前的研究已为单响应变量的最小二乘和 $\ell_p$ 线性回归构建了近乎最优的核心集。然而,对于可能存在 $m$ 个响应变量的多重 $\ell_p$ 回归,目前尚无已知的构建方法能使其规模在 $m$ 上呈次线性。在本工作中,我们为核心集构建了规模为 $\tilde O(\varepsilon^{-2}d)$(当 $p<2$ 时)和 $\tilde O(\varepsilon^{-p}d^{p/2})$(当 $p>2$ 时)的构造,该规模独立于 $m$(即维度无关),并且能在定义域内的每一点上以 $(1\pm\varepsilon)$ 的相对误差近似多重 $\ell_p$ 回归目标。如果我们仅需在子空间约束下保留最小化器,则对于所有 $p>1$ 的情况,我们可以将这些界改进一个 $\varepsilon$ 因子。我们得到的所有界都是近乎紧的。我们给出了两个应用结果。首先,我们确定了近似 $\ell_p$ 欧几里得幂均值至 $(1+\varepsilon)$ 因子所需的均匀样本数量,证明了 $\tilde\Theta(\varepsilon^{-2})$ 个样本(当 $p = 1$ 时)、$\tilde\Theta(\varepsilon^{-1})$ 个样本(当 $1 < p < 2$ 时)以及 $\tilde\Theta(\varepsilon^{1-p})$ 个样本(当 $p>2$ 时)是紧的,从而回答了 Cohen-Addad、Saulpic 和 Schwiegelshohn 提出的一个问题。其次,我们证明对于 $1<p<2$,每个矩阵都存在一个包含 $\tilde O(\varepsilon^{-1}k)$ 行的子集,该子集张成一个 $(1+\varepsilon)$ 近似最优的 $k$ 维子空间用于 $\ell_p$ 子空间近似,该结果也是近乎最优的。