The generalized linear mixed model (GLMM) is widely used for analyzing correlated data, particularly in large-scale biomedical and social science applications. Scalable Bayesian inference for GLMMs is challenging because the marginal likelihood is intractable and conventional Markov chain Monte Carlo (MCMC) methods become computationally prohibitive as the number of subjects grows. We develop a stochastic gradient MCMC (SGMCMC) algorithm tailored to GLMMs that enables accurate posterior inference in the large-sample regime. Our approach uses Fisher's identity to construct an unbiased Monte Carlo estimator of the gradient of the marginal log-likelihood, making SGMCMC feasible when direct gradient computation is impossible. We analyze the additional variability introduced by both minibatching and gradient approximation, and derive a post-hoc covariance correction that yields properly calibrated posterior uncertainty. Through simulations, we show that the proposed method provides accurate posterior means and variances, outperforming existing approaches, including control variate methods, in large-$n$ settings. We further demonstrate the method's practical utility in an analysis of electronic health records data, where accounting for variance inflation materially changes scientific conclusions.
翻译:广义线性混合模型(GLMM)被广泛用于分析相关数据,尤其在大规模生物医学与社会科学应用中。针对GLMM的可扩展贝叶斯推断具有挑战性,因为其边缘似然难以处理,且当样本个体数量增长时,传统的马尔可夫链蒙特卡罗(MCMC)方法在计算上变得不可行。我们开发了一种专用于GLMM的随机梯度MCMC(SGMCMC)算法,使其能够在大样本场景下实现准确的后验推断。该方法利用Fisher恒等式构建边缘对数似然梯度的无偏蒙特卡洛估计量,从而在直接梯度计算不可行时实现SGMCMC。我们分析了由小批量采样与梯度近似引入的额外变异性,并推导出一种事后协方差校正方法,以获得校准准确的后验不确定性。通过模拟实验,我们证明所提方法能提供准确的后验均值与方差,在大样本(large-$n$)设定下优于包括控制变量法在内的现有方法。我们进一步通过电子健康记录数据的分析展示了该方法的实用价值——其中对方差膨胀的考量显著改变了科学结论。