Linear mixed models (LMMs), which typically assume normality for both the random effects and error terms, are a popular class of methods for analyzing longitudinal and clustered data. However, such models can be sensitive to outliers, and this can lead to poor statistical results (e.g., biased inference on model parameters and inaccurate prediction of random effects) if the data are contaminated. We propose a new approach to robust estimation and inference for LMMs using a hierarchical gamma divergence, which offers an automated, data-driven approach to downweight the effects of outliers occurring in both the error, and the random effects, using normalized powered density weights. For estimation and inference, we develop a computationally scalable minorization-maximization algorithm for the resulting objective function, along with a clustered bootstrap method for uncertainty quantification and a Hyvarinen score criterion for selecting a tuning parameter controlling the degree of robustness. When the genuine and contamination mixed effects distributions are sufficiently separated, then under suitable regularity conditions assuming the number of clusters tends to infinity, we show the resulting robust estimates can be asymptotically controlled even under a heavy level of (covariate-dependent) contamination. Simulation studies demonstrate hierarchical gamma divergence consistently outperforms several currently available methods for robustifying LMMs, under a wide range of scenarios of outlier generation at both the response and random effects levels. We illustrate the proposed method using data from a multi-center AIDS cohort study, where the use of a robust LMMs using hierarchical gamma divergence approach produces noticeably different results compared to methods that do not adequately adjust for potential outlier contamination.
翻译:线性混合模型(LMMs)通常假设随机效应和误差项均服从正态分布,是分析纵向数据和聚类数据的一类常用方法。然而,此类模型对异常值较为敏感,若数据受到污染,则可能导致较差的统计结果(例如,模型参数推断存在偏差,随机效应预测不准确)。我们提出一种基于分层Gamma散度的新方法,用于LMMs的稳健估计与推断。该方法通过归一化的幂密度权重,提供了一种自动化的、数据驱动的方式来降低误差项和随机效应中异常值的影响。在估计与推断方面,我们为所得目标函数开发了一种计算可扩展的极小化-极大化算法,同时结合了用于不确定性量化的聚类自助法,以及用于选择控制稳健性程度的调优参数的Hyvarinen评分准则。当真实混合效应分布与污染混合效应分布充分分离时,在适当的正则性条件下(假设聚类数量趋于无穷大),我们证明即使存在严重的(协变量依赖型)污染水平,所得稳健估计的渐近性仍可得到控制。模拟研究表明,在响应层面和随机效应层面多种异常值生成场景下,分层Gamma散度方法在稳健化LMMs方面持续优于当前已有的多种方法。我们通过一项多中心艾滋病队列研究的数据对所提方法进行了说明,其中使用基于分层Gamma散度的稳健LMMs方法所得结果,与未充分调整潜在异常值污染的方法相比,产生了明显不同的结论。