Bounded-memory adjusted scores estimation in generalized linear models with large data sets

The widespread use of maximum Jeffreys'-prior penalized likelihood in binomial-response generalized linear models, and in logistic regression, in particular, are supported by the results of Kosmidis and Firth (2021, Biometrika), who show that the resulting estimates are also always finite-valued, even in cases where the maximum likelihood estimates are not, which is a practical issue regardless of the size of the data set. In logistic regression, the implied adjusted score equations are formally bias-reducing in asymptotic frameworks with a fixed number of parameters and appear to deliver a substantial reduction in the persistent bias of the maximum likelihood estimator in high-dimensional settings where the number of parameters grows asymptotically linearly and slower than the number of observations. In this work, we develop and present two new variants of iteratively reweighted least squares for estimating generalized linear models with adjusted score equations for mean bias reduction and maximization of the likelihood penalized by a positive power of the Jeffreys-prior penalty, which eliminate the requirement of storing $O(n)$ quantities in memory, and can operate with data sets that exceed computer memory or even hard drive capacity. We achieve that through incremental QR decompositions, which enable IWLS iterations to have access only to data chunks of predetermined size. We assess the procedures through a real-data application with millions of observations, and in high-dimensional logistic regression, where a large-scale simulation experiment produces concrete evidence for the existence of a simple adjustment to the maximum Jeffreys'-penalized likelihood estimates that delivers high accuracy in terms of signal recovery even in cases where estimates from ML and other recently-proposed corrective methods do not exist.

翻译：在二项响应广义线性模型（尤其是逻辑回归）中，Kosmidis和Firth（2021, Biometrika）的结果支持了Jeffreys先验惩罚最大似然的广泛使用。他们证明，即使最大似然估计在某些情况下不存在（无论数据集大小如何，这都是一个实际问题），由此产生的估计也总是有限值。在逻辑回归中，隐含的调整得分方程在参数数量固定的渐近框架下形式上是减小偏差的，并且在参数数量随观测数量呈渐近线性且更慢增长的高维场景中，能够显著减少最大似然估计的持续偏差。本文开发并提出两种新的迭代加权最小二乘变体，用于通过平均偏差缩小的调整得分方程以及由Jeffreys先验惩罚的正幂次惩罚化似然最大化来估计广义线性模型。这些方法消除了在内存中存储$O(n)$数量项的需求，并能处理超过计算机内存甚至硬盘容量的数据集。我们通过增量QR分解实现这一点，使得IWLS迭代只能访问预定大小的数据块。我们通过包含数百万观测值的实际数据应用以及高维逻辑回归来评估这些方法：在高维逻辑回归中，大规模模拟实验提供了具体证据，表明即使当最大似然估计及其他近期提出的校正方法无法产生估计时，对Jeffreys惩罚最大似然估计进行简单调整也能实现高精度的信号恢复。