D&R is a statistical approach designed to handle large and complex datasets. It partitions the dataset into several manageable subsets and subsequently applies the analytic method to each subset independently to obtain results. Finally, the results from each subset are combined to yield the results for the entire dataset. D&R strategies can be implemented to fit GLMs to datasets too large for conventional methods. Several D&R strategies are available for different GLMs, some of which are theoretically justified but lack practical validation. A significant limitation is the theoretical and practical justification for estimating combined standard errors and confidence intervals. This paper reviews D&R strategies for GLMs and proposes a method to determine the combined standard error for D&R-based estimators. In addition to the traditional dataset division procedures, we propose a different division method named sequential partitioning for D&R-based estimators on GLMs. We show that the obtained D&R estimator with the proposed standard error attains equivalent efficiency as the full data estimate. We illustrate this on a large synthetic dataset and verify that the results from D&R are accurate and identical to those from other available R packages.
翻译:D&R是一种为处理大规模复杂数据集而设计的统计方法。该方法将数据集划分为多个可管理的子集,随后对每个子集独立应用分析方法以获取结果,最终将各子集的结果合并,从而得到整个数据集的结果。通过实施D&R策略,可将广义线性模型(GLM)拟合到传统方法无法处理的大规模数据集上。针对不同的GLM存在多种D&R策略,其中部分策略虽具有理论依据但缺乏实践验证。一个显著的局限性在于合并标准误与置信区间的估计缺乏理论与实践依据。本文综述了适用于GLM的D&R策略,并提出一种确定基于D&R估计量合并标准误的方法。除传统的数据集划分流程外,我们针对GLM提出一种名为顺序划分的新型D&R划分方法。我们证明,采用所提标准误的D&R估计量可获得与全数据估计同等的效率。通过大型合成数据集的实例验证,D&R所得结果准确且与其他现有R软件包的结果完全一致。