The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database.
翻译:广义线性混合模型(GLMM)是一种处理相关数据的流行统计方法,广泛应用于大数据常见领域(包括生物医学数据场景)。本文聚焦于GLMM的可扩展统计推断,其中我们将统计推断定义为:(i)群体参数估计,以及(ii)在不确定性存在下科学假设的评估。人工智能(AI)学习算法擅长可扩展统计估计,但很少包含不确定性量化。相比之下,贝叶斯推断提供完整的统计推断,因为后验分布能自动生成不确定性量化结果。遗憾的是,包括马尔可夫链蒙特卡洛(MCMC)在内的贝叶斯推断算法在大数据场景下会变得计算不可行。本文提出一种融合AI与贝叶斯推断的统计推断算法,该算法既利用了现代AI算法的可扩展性,又保留了贝叶斯推断所伴随的有保证的不确定性量化能力。我们的算法是随机梯度MCMC的扩展,其创新贡献体现在处理相关数据(即难处理的边缘似然)和正确的后验方差估计方面。通过理论与实证结果,我们确立了该算法的统计推断性质,并将其应用于大规模电子健康记录数据库。