Application of generalized linear models in big data: a divide and recombine (D&R) approach

D&R is a statistical approach designed to handle large and complex datasets. It partitions the dataset into several manageable subsets and subsequently applies the analytic method to each subset independently to obtain results. Finally, the results from each subset are combined to yield the results for the entire dataset. D&R strategies can be implemented to fit GLMs to datasets too large for conventional methods. Several D&R strategies are available for different GLMs, some of which are theoretically justified but lack practical validation. A significant limitation is the theoretical and practical justification for estimating combined standard errors and confidence intervals. This paper reviews D&R strategies for GLMs and proposes a method to determine the combined standard error for D&R-based estimators. In addition to the traditional dataset division procedures, we propose a different division method named sequential partitioning for D&R-based estimators on GLMs. We show that the obtained D&R estimator with the proposed standard error attains equivalent efficiency as the full data estimate. We illustrate this on a large synthetic dataset and verify that the results from D&R are accurate and identical to those from other available R packages.

翻译：D&R是一种为处理大规模复杂数据集而设计的统计方法。该方法将数据集划分为若干可管理的子集，随后独立地对每个子集应用分析方法以获得结果，最终合并各子集的结果以得到完整数据集的结果。通过实施D&R策略，可将广义线性模型拟合至传统方法无法处理的大规模数据集。针对不同的广义线性模型存在多种D&R策略，其中部分策略虽具有理论依据但缺乏实践验证。当前的主要局限在于合并标准误与置信区间的估计缺乏理论与实践支撑。本文系统综述了广义线性模型的D&R策略，并提出一种确定基于D&R估计量合并标准误的方法。除传统的数据集划分流程外，我们针对广义线性模型的D&R估计量提出了一种名为顺序划分的新型划分方法。我们证明采用所提标准误的D&R估计量能达到与全数据估计同等的效率。通过大型合成数据集的实证分析，我们验证了D&R所得结果具有精确性，且与其他现有R软件包的结果完全一致。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日