Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second, that of its accuracy when it exists. These properties depend on both the dimension of covariates and the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.
翻译:逻辑回归是描述二元响应变量与多元协变量之间概率依赖关系的经典模型。本文考察逻辑回归中最大似然估计(MLE)的预测性能,该性能通过逻辑风险进行评估。我们探讨两个问题:首先是MLE的存在性(当数据集未被线性分离时出现),其次是其存在时的准确性。这些性质同时取决于协变量的维度和信号强度。在协变量服从高斯分布且逻辑模型设定正确的情况下,我们获得了关于MLE存在性及超额逻辑风险的尖锐非渐近保证。随后我们将这些结果从两个方面进行推广:首先推广至满足特定二维边界条件的非高斯协变量,其次推广至可能设定错误的逻辑模型的一般统计学习情形。最后,我们考察伯努利设计的情形,其中MLE的表现对参数方向具有高度敏感性。