Hierarchical data analysis is crucial in various fields for making discoveries. The linear mixed model is often used for training hierarchical data, but its parameter estimation is computationally expensive, especially with big data. Subsampling techniques have been developed to address this challenge. However, most existing subsampling methods assume homogeneous data and do not consider the possible heterogeneity in hierarchical data. To address this limitation, we develop a new approach called group-orthogonal subsampling (GOSS) for selecting informative subsets of hierarchical data that may exhibit heterogeneity. GOSS selects subdata with balanced data size among groups and combinatorial orthogonality within each group, resulting in subdata that are $D$- and $A$-optimal for building linear mixed models. Estimators of parameters trained on GOSS subdata are consistent and asymptotically normal. GOSS is shown to be numerically appealing via simulations and a real data application. Theoretical proofs, R codes, and supplementary numerical results are accessible online as Supplementary Materials.
翻译:分层数据分析在多个领域对科学发现至关重要。线性混合模型常被用于训练分层数据,但其参数估计计算成本高昂,尤其在处理大数据时更为突出。为解决这一挑战,研究人员开发了子采样技术。然而,现有大多子采样方法假设数据同质,未考虑分层数据中可能存在的异质性。针对这一局限,我们提出了一种新方法——群正交子采样(Group-Orthogonal Subsampling, GOSS),用于从可能呈现异质性的分层数据中选取信息子集。GOSS通过确保各群组间数据规模均衡且群内组合正交性,生成构建线性混合模型的D-最优与A-最优子数据。基于GOSS子数据训练的参数估计量具有一致性与渐近正态性。仿真实验与真实数据应用验证了GOSS的数值优越性。理论证明、R代码及补充数值结果已在线上辅助材料中公开。