Application of generalized linear models in big data: a divide and recombine (D&R) approach

D&R is a statistical approach designed to handle large and complex datasets. It partitions the dataset into several manageable subsets and subsequently applies the analytic method to each subset independently to obtain results. Finally, the results from each subset are combined to yield the results for the entire dataset. D&R strategies can be implemented to fit GLMs to datasets too large for conventional methods. Several D&R strategies are available for different GLMs, some of which are theoretically justified but lack practical validation. A significant limitation is the theoretical and practical justification for estimating combined standard errors and confidence intervals. This paper reviews D&R strategies for GLMs and proposes a method to determine the combined standard error for D&R-based estimators. In addition to the traditional dataset division procedures, we propose a different division method named sequential partitioning for D&R-based estimators on GLMs. We show that the obtained D&R estimator with the proposed standard error attains equivalent efficiency as the full data estimate. We illustrate this on a large synthetic dataset and verify that the results from D&R are accurate and identical to those from other available R packages.

翻译：D&R是一种为处理大规模复杂数据集而设计的统计方法。该方法将数据集划分为多个可管理的子集，随后对每个子集独立应用分析方法以获取结果，最终将各子集的结果合并，从而得到整个数据集的结果。通过实施D&R策略，可将广义线性模型（GLM）拟合到传统方法无法处理的大规模数据集上。针对不同的GLM存在多种D&R策略，其中部分策略虽具有理论依据但缺乏实践验证。一个显著的局限性在于合并标准误与置信区间的估计缺乏理论与实践依据。本文综述了适用于GLM的D&R策略，并提出一种确定基于D&R估计量合并标准误的方法。除传统的数据集划分流程外，我们针对GLM提出一种名为顺序划分的新型D&R划分方法。我们证明，采用所提标准误的D&R估计量可获得与全数据估计同等的效率。通过大型合成数据集的实例验证，D&R所得结果准确且与其他现有R软件包的结果完全一致。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日