Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression

Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. For example, Aphasia Quotient (AQ) is a critical measure of language impairment and informs treatment decisions, but it is challenging to measure in stroke patients. It is of interest to use high-resolution T2 neuroimages of brain damage to predict AQ. However, sparse regression models show marked evidence of heteroscedastic error even after transformations are applied. This violation of the homoscedasticity assumption can lead to bias in estimated coefficients, prediction intervals (PI) with improper length, and increased type I errors. Bayesian heteroscedastic linear regression models relax the homoscedastic error assumption but can enforce restrictive prior assumptions on parameters, and many are computationally infeasible in the high-dimensional setting. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori estimation approach that requires minimal prior assumptions and can incorporate covariates hypothesized to impact heterogeneity. We apply this method by using high-dimensional neuroimages to predict and provide PIs for AQ that accurately quantify predictive uncertainty. Our analysis demonstrates that H-PROBE can provide narrower PI widths than standard methods without sacrificing coverage. Narrower PIs are clinically important for determining the risk of moderate to severe aphasia. Additionally, through extensive simulation studies, we exhibit that H-PROBE results in superior prediction, variable selection, and predictive inference compared to alternative methods.

翻译：高维数据的稀疏线性回归方法通常假设残差具有恒定方差，但这一假设在实践中可能被违反。例如，失语症商数（AQ）是语言障碍的关键测量指标，影响治疗决策，但在卒中患者中难以测量。利用高分辨率T2神经影像预测AQ具有重要意义。然而，即使经过变换，稀疏回归模型仍表现出显著的异方差误差。违反方差齐性假设可能导致估计系数偏差、预测区间（PI）长度不当以及I类错误增加。贝叶斯异方差线性回归模型放宽了误差方差齐性假设，但可能对参数施加严格先验假设，且许多方法在高维场景中计算不可行。本文提出采用异方差分区经验贝叶斯期望条件最大化（H-PROBE）算法进行高维异方差线性回归模型估计。H-PROBE是一种计算高效的最大后验估计方法，所需先验假设极少，并可纳入可能影响异质性的协变量。我们应用该方法，通过高维神经影像预测AQ并提供PI，从而精准量化预测不确定性。分析表明，H-PROBE在保证覆盖率的条件下，能获得比标准方法更窄的PI宽度。较窄的PI对临床评估中重度失语症风险具有关键意义。此外，大量仿真研究证实，与替代方法相比，H-PROBE在预测、变量选择及预测推断方面均表现更优。