We study empirical Bayes (EB) predictive density estimation in linear mixed models (LMMs) with large number of units, which induce a high dimensional random effects space. Focusing on Kullback Leibler (KL) risk minimization, we develop a calibration framework to optimally tune predictive densities derived from on a broad class of flexible priors. Our proposed method addresses two key challenges in predictive inference: (a) severe data scarcity leading to highly imbalanced designs, in which replicates are available for only a small subset of units; and (b) distributional shifts in future covariates. To estimate predictive KL risk in LMMs, we use a data-fission approach that leverages exchangeability in the covariate distribution. We establish convergence rates for our proposed risk estimators and show how their efficiency deteriorates as data scarcity increases. Our results imply the decision-theoretic optimality of the proposed EB predictive density estimator. The theoretical development relies on a novel probabilistic analysis of the interaction between data fission, sample reuse, and the predictive heat-equation representation of George et al. (2006), which expresses predictive KL risk through expected log-marginals. Extensive simulation studies demonstrate strong predictive performance and robustness of the proposed approach across diverse regimes with varying degrees of data scarcity and covariate shift.
翻译:我们研究大规模单元线性混合模型(LMMs)中的经验贝叶斯(EB)预测密度估计问题,该类模型具有高维随机效应空间。聚焦于库尔贝克-莱布勒(KL)风险最小化,我们开发了一套校准框架,用于最优调整基于一类广泛灵活先验的预测密度。所提方法应对预测推断中的两大关键挑战:(a)严重数据缺失导致高度不平衡的设计,仅有少量单元存在重复观测;(b)未来协变量的分布偏移。为估计LMMs中的预测KL风险,我们采用数据分解方法,利用协变量分布的可交换性。我们建立了所提风险估计量的收敛速度,并展示其效率如何随数据缺失程度加剧而退化。研究结果证明了所提EB预测密度估计量在决策论意义上的最优性。理论推导依赖于对数据分解、样本复用及George等人(2006)预测热方程表示三者交互作用的新型概率分析,该表示通过期望对数边际函数表达预测KL风险。大量模拟研究表明,所提方法在不同数据缺失程度和协变量偏移场景下均展现出优异的预测性能与稳健性。