Heteroscedastic sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm

Sparse linear regression methods for high-dimensional data often assume that residuals have constant variance. When this assumption is violated, it can lead to bias in estimated coefficients, prediction intervals with improper length, and increased type I errors. This paper proposes a heteroscedastic (H) high-dimensional linear regression model through a partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori (MAP) estimation approach based on a Parameter-Expanded Expectation-Conditional-Maximization (PX-ECM) algorithm. It requires minimal prior assumptions on the regression parameters through plug-in empirical Bayes estimates of hyperparameters. The variance model uses recent advances in multivariate log-Gamma distribution theory and can include covariates hypothesized to impact heterogeneity. The motivation of our approach is a study relating Aphasia Quotient (AQ) to high-resolution T2 neuroimages of brain damage in stroke patients. AQ is a vital measure of language impairment and informs treatment decisions, but it is challenging to measure and subject to heteroscedastic errors. As a result, it is of clinical importance -- and the goal of this paper -- to use high-dimensional neuroimages to predict and provide prediction intervals for AQ that accurately reflect the heterogeneity in the residual variance. Our analysis demonstrates that H-PROBE can use markers of heterogeneity to provide prediction interval widths that are narrower than standard methods without sacrificing coverage. Further, through extensive simulation studies, we exhibit that the proposed approach results in superior prediction, variable selection, and predictive inference than competing methods.

翻译：针对高维数据的稀疏线性回归方法通常假设残差具有恒定方差。当这一假设不成立时，会导致估计系数产生偏差、预测区间长度不当以及I类错误率上升。本文通过分区经验贝叶斯条件期望最大化（H-PROBE）算法，提出一种异方差（H）高维线性回归模型。H-PROBE是一种基于参数扩展条件期望最大化（PX-ECM）算法的计算高效的最大后验（MAP）估计方法。该方法通过超参数的插件经验贝叶斯估计，对回归参数仅需极少的先验假设。方差模型采用多变量对数伽马分布理论的最新进展，并可纳入假设影响异质性的协变量。本研究的动机源于一项关于脑卒中患者失语商数（AQ）与高分辨率T2脑损伤神经影像学关联的研究。AQ是衡量语言障碍的重要指标，直接影响治疗决策，但难以精确测量且存在异方差误差。因此，利用高维神经影像预测AQ并构建能准确反映残差方差异质性的预测区间具有临床重要性——这正是本文的目标。分析表明，H-PROBE可利用异质性标记提供较标准方法更窄的预测区间宽度，且不牺牲覆盖率。此外，通过大量模拟研究，我们证明所提方法在预测性能、变量选择及预测推断方面均优于现有对比方法。