D&R is a statistical approach designed to handle large and complex datasets. It partitions the dataset into several manageable subsets and subsequently applies the analytic method to each subset independently to obtain results. Finally, the results from each subset are combined to yield the results for the entire dataset. D&R strategies can be implemented to fit GLMs to datasets too large for conventional methods. Several D&R strategies are available for different GLMs, some of which are theoretically justified but lack practical validation. A significant limitation is the theoretical and practical justification for estimating combined standard errors and confidence intervals. This paper reviews D&R strategies for GLMs and proposes a method to determine the combined standard error for D&R-based estimators. In addition to the traditional dataset division procedures, we propose a different division method named sequential partitioning for D&R-based estimators on GLMs. We show that the obtained D&R estimator with the proposed standard error attains equivalent efficiency as the full data estimate. We illustrate this on a large synthetic dataset and verify that the results from D&R are accurate and identical to those from other available R packages.
翻译:D&R是一种为处理大规模复杂数据集而设计的统计方法。该方法将数据集划分为若干可管理的子集,随后独立地对每个子集应用分析方法以获得结果,最终合并各子集的结果以得到完整数据集的结果。通过实施D&R策略,可将广义线性模型拟合至传统方法无法处理的大规模数据集。针对不同的广义线性模型存在多种D&R策略,其中部分策略虽具有理论依据但缺乏实践验证。当前的主要局限在于合并标准误与置信区间的估计缺乏理论与实践支撑。本文系统综述了广义线性模型的D&R策略,并提出一种确定基于D&R估计量合并标准误的方法。除传统的数据集划分流程外,我们针对广义线性模型的D&R估计量提出了一种名为顺序划分的新型划分方法。我们证明采用所提标准误的D&R估计量能达到与全数据估计同等的效率。通过大型合成数据集的实证分析,我们验证了D&R所得结果具有精确性,且与其他现有R软件包的结果完全一致。