The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learning algorithms excel at scalable statistical estimation, but rarely include uncertainty quantification. In contrast, Bayesian inference provides full statistical inference, since uncertainty quantification results automatically from the posterior distribution. Unfortunately, Bayesian inference algorithms, including Markov Chain Monte Carlo (MCMC), become computationally intractable in big data settings. In this paper, we introduce a statistical inference algorithm at the intersection of AI and Bayesian inference, that leverages the scalability of modern AI algorithms with guaranteed uncertainty quantification that accompanies Bayesian inference. Our algorithm is an extension of stochastic gradient MCMC with novel contributions that address the treatment of correlated data (i.e., intractable marginal likelihood) and proper posterior variance estimation. Through theoretical and empirical results we establish our algorithm's statistical inference properties, and apply the method in a large electronic health records database.
翻译:广义线性混合模型(GLMM)是处理相关数据的主流统计方法,广泛应用于大数据场景的多个领域(包括生物医学数据环境)。本文聚焦于GLMM的可扩展统计推断,其中统计推断定义为:(i)总体参数估计,以及(ii)不确定性存在时科学假设的评估。人工智能(AI)学习算法擅长可扩展统计估计,但鲜少包含不确定性量化。相比之下,贝叶斯推断通过后验分布自动实现不确定性量化,从而提供完整的统计推断。遗憾的是,在大数据环境下,包括马尔可夫链蒙特卡洛(MCMC)在内的贝叶斯推断算法会面临计算不可行的问题。本文提出一种融合AI与贝叶斯推断的统计推断算法,该算法兼具现代AI算法的可扩展性,并能保证贝叶斯推断所附带的不确定性量化能力。我们的算法是随机梯度MCMC的扩展版本,其创新贡献在于处理相关数据(即不可解的边缘似然)以及实现正确的后验方差估计。通过理论与实证结果,我们验证了该算法的统计推断特性,并将其应用于大型电子健康记录数据库。