Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.
翻译:逻辑回归是描述二元响应变量与多元协变量之间概率依赖关系的经典模型。本文考察逻辑回归中最大似然估计量(MLE)的预测性能,该性能通过逻辑风险进行评估。我们探讨两个核心问题:首先是MLE的存在性问题(当数据集未被线性分离时出现),其次是其存在时的精确度问题。这些性质同时取决于协变量的维度和信号强度。在高斯协变量与正确设定的逻辑模型情形下,我们获得了关于MLE存在性及超额逻辑风险的尖锐非渐近保证。随后我们将这些结果从两方面进行推广:首先推广至满足特定二维边界条件的非高斯协变量,其次推广至统计学习中可能误设逻辑模型的一般情形。最后,我们考察伯努利设计的情形,其中MLE的行为对参数方向表现出高度敏感性。