The logistic regression estimator is known to inflate the magnitude of its coefficients if the sample size $n$ is small, the dimension $p$ is (moderately) large or the signal-to-noise ratio $1/\sigma$ is large (probabilities of observing a label are close to 0 or 1). With this in mind, we study the logistic regression estimator with $p\ll n/\log n$, assuming Gaussian covariates and labels generated by the Gaussian link function, with a mild optimization constraint on the estimator's length to ensure existence. We provide finite sample guarantees for its direction, which serves as a classifier, and its Euclidean norm, which is an estimator for the signal-to-noise ratio. We distinguish between two regimes. In the low-noise/small-sample regime ($n\sigma\lesssim p\log n$), we show that the estimator's direction (and consequentially the classification error) achieve the rate $(p\log n)/n$ - as if the problem was noiseless. In this case, the norm of the estimator is at least of order $n/(p\log n)$. If instead $n\sigma\gtrsim p\log n$, the estimator's direction achieves the rate $\sqrt{\sigma p\log n/n}$, whereas its norm converges to the true norm at the rate $\sqrt{p\log n/(n\sigma^3)}$. As a corollary, the data are not linearly separable with high probability in this regime. The logistic regression estimator allows to conclude which regime occurs with high probability. Therefore, inference for logistic regression is possible in the regime $n\sigma\gtrsim p\log n$. In either case, logistic regression provides a competitive classifier.
翻译:逻辑回归估计量在样本量 $n$ 较小、维度 $p$(中等程度)较大或信噪比 $1/\sigma$ 较大(观测到标签的概率接近0或1)时,已知会膨胀其系数的幅度。基于此,我们研究在假设高斯协变量和由高斯链接函数生成的标签,并对估计量长度施加适度优化约束以确保存在性的条件下,当 $p\ll n/\log n$ 时的逻辑回归估计量。我们为其方向(作为分类器)和欧几里得范数(作为信噪比的估计量)提供了有限样本保证。我们区分两种情形。在低噪声/小样本情形($n\sigma\lesssim p\log n$)下,我们证明估计量的方向(进而分类误差)达到速率 $(p\log n)/n$——仿佛问题是无噪声的。此时,估计量的范数至少为 $n/(p\log n)$ 量级。若 $n\sigma\gtrsim p\log n$,则估计量的方向达到速率 $\sqrt{\sigma p\log n/n}$,而其范数以速率 $\sqrt{p\log n/(n\sigma^3)}$ 收敛于真实范数。作为推论,在此情形下数据以高概率不是线性可分的。逻辑回归估计量能高概率地确定出现哪种情形。因此,在 $n\sigma\gtrsim p\log n$ 情形下,逻辑回归的推断是可行的。无论哪种情况,逻辑回归都提供了一个有竞争力的分类器。