We study high-dimensional, ridge-regularized logistic regression in a setting in which the covariates may be missing or corrupted by additive noise. When both the covariates and the additive corruptions are independent and normally distributed, we provide exact characterizations of both the prediction error as well as the estimation error. Moreover, we show that these characterizations are universal: as long as the entries of the data matrix satisfy a set of independence and moment conditions, our guarantees continue to hold. Universality, in turn, enables the detailed study of several imputation-based strategies when the covariates are missing completely at random. We ground our study by comparing the performance of these strategies with the conjectured performance -- stemming from replica theory in statistical physics -- of the Bayes optimal procedure. Our analysis yields several insights including: (i) a distinction between single imputation and a simple variant of multiple imputation and (ii) that adding a simple ridge regularization term to single-imputed logistic regression can yield an estimator whose prediction error is nearly indistinguishable from the Bayes optimal prediction error. We supplement our findings with extensive numerical experiments.
翻译:本文研究高维岭正则化逻辑回归在协变量可能缺失或受加性噪声污染情况下的表现。当协变量与加性污染项均服从独立正态分布时,我们精确刻画了预测误差与估计误差的特性。进一步,我们证明这些刻画具有普适性:只要数据矩阵的条目满足一组独立性与矩条件,我们的理论保证依然成立。这种普适性使我们能够深入探究当协变量完全随机缺失时,多种基于插补策略的性能表现。我们通过比较这些策略与贝叶斯最优方法(源自统计物理学中的复本理论)的推测性能来夯实研究基础。分析得出若干重要见解,包括:(i)单一插补与多重插补简单变体之间的本质区别;(ii)对单一插补逻辑回归添加简单岭正则化项,可得到预测误差与贝叶斯最优预测误差几乎无法区分的估计量。我们通过大量数值实验对研究结果进行了补充验证。