Heteroscedastic sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm

Sparse linear regression methods for high-dimensional data often assume that residuals have constant variance. When this assumption is violated, it can lead to bias in estimated coefficients, prediction intervals (PI) with improper length, and increased type I errors. We propose a heteroscedastic high-dimensional linear regression model through a partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori estimation approach based on a Parameter-Expanded Expectation-Conditional-Maximization algorithm. It requires minimal prior assumptions on the regression parameters through plug-in empirical Bayes estimates of hyperparameters. The variance model uses a multivariate log-Gamma prior on coefficients that can incorporate covariates hypothesized to impact heterogeneity. The motivation of our approach is a study relating Aphasia Quotient (AQ) to high-resolution T2 neuroimages of brain damage in stroke patients. AQ is a vital measure of language impairment and informs treatment decisions, but it is challenging to measure and subject to heteroscedastic errors. It is, therefore, of clinical importance -- and the goal of this paper -- to use high-dimensional neuroimages to predict and provide PIs for AQ that accurately reflect the heterogeneity in residual variance. Our analysis demonstrates that H-PROBE can use markers of heterogeneity to provide narrower PI widths than standard methods without sacrificing coverage. Through extensive simulation studies, we exhibit that H-PROBE results in superior prediction, variable selection, and predictive inference than competing methods.

翻译：高维数据的稀疏线性回归方法通常假设残差具有恒定方差。当这一假设被违反时，可能导致估计系数存在偏差、预测区间（PI）长度不当以及I类错误率增加。本文通过分区经验贝叶斯期望条件最大化（H-PROBE）算法，提出一种异方差高维线性回归模型。H-PROBE是一种基于参数扩展期望条件最大化算法的计算高效的最大后验估计方法，通过超参数的插件式经验贝叶斯估计，对回归参数所需先验假设极低。其方差模型采用多元对数Gamma先验分布对系数建模，可纳入假定影响异质性的协变量。本方法的动机源于一项研究——将失语指数（AQ）与脑卒中患者脑损伤的高分辨率T2神经影像相关联。AQ是衡量语言障碍的关键指标并指导治疗决策，但其测量困难且存在异方差误差。因此，利用高维神经影像预测AQ并提供准确反映残差方差异质性的预测区间具有临床重要性——这也是本文的研究目标。分析表明，H-PROBE可利用异质性标记在保持覆盖率的前提下提供比标准方法更窄的预测区间。通过大量模拟研究，我们展示H-PROBE在预测、变量选择及预测推断方面均优于现有方法。