We provide space complexity lower bounds for data structures that approximate logistic loss up to $\epsilon$-relative error on a logistic regression problem with data $\mathbf{X} \in \mathbb{R}^{n \times d}$ and labels $\mathbf{y} \in \{-1,1\}^d$. The space complexity of existing coreset constructions depend on a natural complexity measure $\mu_\mathbf{y}(\mathbf{X})$, first defined in (Munteanu, 2018). We give an $\tilde{\Omega}(\frac{d}{\epsilon^2})$ space complexity lower bound in the regime $\mu_\mathbf{y}(\mathbf{X}) = O(1)$ that shows existing coresets are optimal in this regime up to lower order factors. We also prove a general $\tilde{\Omega}(d\cdot \mu_\mathbf{y}(\mathbf{X}))$ space lower bound when $\epsilon$ is constant, showing that the dependency on $\mu_\mathbf{y}(\mathbf{X})$ is not an artifact of mergeable coresets. Finally, we refute a prior conjecture that $\mu_\mathbf{y}(\mathbf{X})$ is hard to compute by providing an efficient linear programming formulation, and we empirically compare our algorithm to prior approximate methods.
翻译:针对在逻辑回归问题(数据为 $\mathbf{X} \in \mathbb{R}^{n \times d}$,标签为 $\mathbf{y} \in \{-1,1\}^d$)上以 $\epsilon$ 相对误差近似逻辑损失的数据结构,我们给出了其空间复杂度的下界。现有核心集构造的空间复杂度依赖于一个自然的复杂度度量 $\mu_\mathbf{y}(\mathbf{X})$(该度量首次由 Munteanu 于 2018 年定义)。在 $\mu_\mathbf{y}(\mathbf{X}) = O(1)$ 的范围内,我们给出了 $\tilde{\Omega}(\frac{d}{\epsilon^2})$ 的空间复杂度下界,表明在此范围内现有核心集除低阶因子外已达到最优。当 $\epsilon$ 为常数时,我们还证明了一个通用的 $\tilde{\Omega}(d\cdot \mu_\mathbf{y}(\mathbf{X}))$ 空间下界,表明对 $\mu_\mathbf{y}(\mathbf{X})$ 的依赖并非可合并核心集所特有的现象。最后,我们通过提出一种高效的线性规划形式化方法,反驳了先前关于 $\mu_\mathbf{y}(\mathbf{X})$ 难以计算的猜想,并在实验中将我们的算法与先前的近似方法进行了比较。