The logistic regression estimator is known to inflate the magnitude of its coefficients if the sample size $n$ is small, the dimension $p$ is (moderately) large or the signal-to-noise ratio $1/\sigma$ is large (probabilities of observing a label are close to 0 or 1). With this in mind, we study the logistic regression estimator with $p\ll n/\log n$, assuming Gaussian covariates and labels generated by the Gaussian link function, with a mild optimization constraint on the estimator's length to ensure existence. We provide finite sample guarantees for its direction, which serves as a classifier, and its Euclidean norm, which is an estimator for the signal-to-noise ratio. We distinguish between two regimes. In the low-noise/small-sample regime ($\sigma\lesssim (p\log n)/n$), we show that the estimator's direction (and consequentially the classification error) achieve the rate $(p\log n)/n$ - up to the log term as if the problem was noiseless. In this case, the norm of the estimator is at least of order $n/(p\log n)$. If instead $(p\log n)/n\lesssim \sigma\lesssim 1$, the estimator's direction achieves the rate $\sqrt{\sigma p\log n/n}$, whereas its norm converges to the true norm at the rate $\sqrt{p\log n/(n\sigma^3)}$. As a corollary, the data are not linearly separable with high probability in this regime. In either regime, logistic regression provides a competitive classifier.
翻译:逻辑回归估计量在样本量$n$较小、维度$p$(适度)较大或信噪比$1/\sigma$较大(观测到标签的概率接近0或1)时,会夸大其系数的幅度。基于此,我们研究了在$p\ll n/\log n$条件下,假设协变量服从高斯分布、标签由高斯链接函数生成,并对估计量长度施加温和优化约束以确保其存在性时,逻辑回归估计量的性质。我们为其方向(作为分类器)和欧几里得范数(作为信噪比的估计量)提供了有限样本保证。我们区分了两种情境。在低噪声/小样本情境($\sigma\lesssim (p\log n)/n$)下,我们证明估计量的方向(进而分类误差)达到速率$(p\log n)/n$——除了对数项外,该问题与无噪声情形无异。此时,估计量的范数至少为$n/(p\log n)$量级。若$(p\log n)/n\lesssim \sigma\lesssim 1$,则估计量的方向达到速率$\sqrt{\sigma p\log n/n}$,而范数以速率$\sqrt{p\log n/(n\sigma^3)}$收敛于真实范数。作为推论,该情境下数据以高概率不可线性分离。无论何种情境,逻辑回归均能提供具有竞争力的分类器。