One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.
翻译:最常见的机器学习设置之一是逻辑回归。在许多分类模型(包括神经网络)中,最终预测是通过将逻辑链接函数应用于线性得分得到的。在二分类逻辑回归中,反馈可以是软标签(对应于数据的真实条件概率,如蒸馏中所用),也可以是采样的硬标签(取值为$\pm 1$)。我们指出即使在特别有利的设置下(目标是学习形式为$σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$的无噪声软目标),也会出现一个基本问题。在过约束情形(即样本数$n$超过输入维度$d$)下,若给出样本$(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$,则足以恢复$\mathbf{w}^{\star}$并达到贝叶斯风险。然而,我们证明当样本的标签由从同一条件分布$σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$采样的硬标签$y_i$标记,且$\mathbf{w}^{\star}$是$s$稀疏时,则旋转不变算法被证明是次优的:它们的超额风险为$Ω\!\left(\frac{d-1}{n}\right)$,而存在简单的非旋转不变算法,其超额风险为$O(\frac{s\log d}{n})$。最简单的旋转不变算法是对逻辑损失进行梯度下降(配合早停)。对于稀疏目标,一种能达到上述上界的简单非旋转不变算法,是对权重$u_i,v_i$进行梯度下降,其中线性权重$w_i$被重参数化为$u_iv_i$。