Bounded-memory adjusted scores estimation in generalized linear models with large data sets

The widespread use of maximum Jeffreys'-prior penalized likelihood in binomial-response generalized linear models, and in logistic regression, in particular, are supported by the results of Kosmidis and Firth (2021, Biometrika), who show that the resulting estimates are also always finite-valued, even in cases where the maximum likelihood estimates are not, which is a practical issue regardless of the size of the data set. In logistic regression, the implied adjusted score equations are formally bias-reducing in asymptotic frameworks with a fixed number of parameters and appear to deliver a substantial reduction in the persistent bias of the maximum likelihood estimator in high-dimensional settings where the number of parameters grows asymptotically linearly and slower than the number of observations. In this work, we develop and present two new variants of iteratively reweighted least squares for estimating generalized linear models with adjusted score equations for mean bias reduction and maximization of the likelihood penalized by a positive power of the Jeffreys-prior penalty, which eliminate the requirement of storing $O(n)$ quantities in memory, and can operate with data sets that exceed computer memory or even hard drive capacity. We achieve that through incremental QR decompositions, which enable IWLS iterations to have access only to data chunks of predetermined size. We assess the procedures through a real-data application with millions of observations, and in high-dimensional logistic regression, where a large-scale simulation experiment produces concrete evidence for the existence of a simple adjustment to the maximum Jeffreys'-penalized likelihood estimates that delivers high accuracy in terms of signal recovery even in cases where estimates from ML and other recently-proposed corrective methods do not exist.

翻译：最大Jeffreys先验惩罚似然在二项响应广义线性模型（特别是逻辑回归）中的广泛应用，得到Kosmidis与Firth（2021,《生物计量学》）研究结果的支持。他们证明，即使在大数据集场景下，最大似然估计可能不存在有限值解时，该方法仍能保证估计量始终具有有限值。在逻辑回归中，隐含的调整得分方程在参数个数固定的渐近框架中具有形式上的偏差缩减特性，且当参数个数随观测数呈渐近线性增长（慢于样本量）的高维场景下，能显著降低最大似然估计量的持续偏差。本研究提出并发展了迭代重加权最小二乘法的两种新变体，通过均值偏差缩减的调整得分方程和Jeffreys先验惩罚的正幂次似然最大化来估计广义线性模型。新方法消除了内存中存储$O(n)$量级数据的需求，可处理超出计算机内存甚至硬盘容量的数据集。我们通过增量QR分解技术实现这一目标，使得IWLS迭代仅需访问预定大小的数据块。基于百万级真实数据应用和采用大规模模拟实验的高维逻辑回归评估表明，当最大似然估计及其他近期提出的校正方法均无法得到有效解时，对最大Jeffreys惩罚似然估计的简单调整仍能在信号恢复方面实现高精度。