We improve upon previous oblivious sketching and turnstile streaming results for $\ell_1$ and logistic regression, giving a much smaller sketching dimension achieving $O(1)$-approximation and yielding an efficient optimization problem in the sketch space. Namely, we achieve for any constant $c>0$ a sketching dimension of $\tilde{O}(d^{1+c})$ for $\ell_1$ regression and $\tilde{O}(\mu d^{1+c})$ for logistic regression, where $\mu$ is a standard measure that captures the complexity of compressing the data. For $\ell_1$-regression our sketching dimension is near-linear and improves previous work which either required $\Omega(\log d)$-approximation with this sketching dimension, or required a larger $\operatorname{poly}(d)$ number of rows. Similarly, for logistic regression previous work had worse $\operatorname{poly}(\mu d)$ factors in its sketching dimension. We also give a tradeoff that yields a $1+\varepsilon$ approximation in input sparsity time by increasing the total size to $(d\log(n)/\varepsilon)^{O(1/\varepsilon)}$ for $\ell_1$ and to $(\mu d\log(n)/\varepsilon)^{O(1/\varepsilon)}$ for logistic regression. Finally, we show that our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.
翻译:我们改进了先前针对$\ell_1$回归与Logistic回归的无偏草图化及数据流处理结果,在实现$O(1)$近似比的同时显著降低了草图维度,并在草图空间内构造出高效的优化问题。具体而言,对任意常数$c>0$,我们实现了$\ell_1$回归的草图维度为$\tilde{O}(d^{1+c})$,Logistic回归的草图维度为$\tilde{O}(\mu d^{1+c})$,其中$\mu$是衡量数据压缩复杂度的标准量度。对于$\ell_1$回归,我们的草图维度达到近乎线性,改进了先前工作——此前在相同维度下只能达到$\Omega(\log d)$近似比,或需要更大的$\operatorname{poly}(d)$行数。类似地,Logistic回归的先前工作在其草图维度中包含更差的$\operatorname{poly}(\mu d)$因子。我们还给出一种权衡方案:通过将总体规模增至$(d\log(n)/\varepsilon)^{O(1/\varepsilon)}$($\ell_1$回归)及$(\mu d\log(n)/\varepsilon)^{O(1/\varepsilon)}$(Logistic回归),可在输入稀疏时间内实现$1+\varepsilon$近似比。最后,我们证明该草图可扩展至近似Logistic回归的正则化版本,其中数据依赖的正则化项对应各Logistic损失的方差。