Quantifying predictive uncertainty of aphasia severity in stroke patients with sparse heteroscedastic Bayesian high-dimensional regression

Sparse linear regression methods for high-dimensional data commonly assume that residuals have constant variance, which can be violated in practice. For example, Aphasia Quotient (AQ) is a critical measure of language impairment and informs treatment decisions, but it is challenging to measure in stroke patients. It is of interest to use high-resolution T2 neuroimages of brain damage to predict AQ. However, sparse regression models show marked evidence of heteroscedastic error even after transformations are applied. This violation of the homoscedasticity assumption can lead to bias in estimated coefficients, prediction intervals (PI) with improper length, and increased type I errors. Bayesian heteroscedastic linear regression models relax the homoscedastic error assumption but can enforce restrictive prior assumptions on parameters, and many are computationally infeasible in the high-dimensional setting. This paper proposes estimating high-dimensional heteroscedastic linear regression models using a heteroscedastic partitioned empirical Bayes Expectation Conditional Maximization (H-PROBE) algorithm. H-PROBE is a computationally efficient maximum a posteriori estimation approach that requires minimal prior assumptions and can incorporate covariates hypothesized to impact heterogeneity. We apply this method by using high-dimensional neuroimages to predict and provide PIs for AQ that accurately quantify predictive uncertainty. Our analysis demonstrates that H-PROBE can provide narrower PI widths than standard methods without sacrificing coverage. Narrower PIs are clinically important for determining the risk of moderate to severe aphasia. Additionally, through extensive simulation studies, we exhibit that H-PROBE results in superior prediction, variable selection, and predictive inference compared to alternative methods.

翻译：高维数据的稀疏线性回归方法通常假设残差具有恒定方差，这一假设在实践中可能被违背。例如，失语商数（AQ）是衡量语言障碍程度的关键指标，并为治疗决策提供依据，但该指标在卒中患者中难以测量。利用高分辨率T2神经影像的脑损伤信息来预测AQ具有重要价值。然而，即使经过数据变换，稀疏回归模型仍显示出明显的异方差误差证据。这种同方差性假设的违背可能导致系数估计偏差、预测区间长度失当以及第一类错误增加。贝叶斯异方差线性回归模型虽然放宽了同方差误差假设，但可能对参数施加限制性先验假设，且多数方法在高维场景下计算不可行。本文提出采用异方差分区经验贝叶斯期望条件最大化（H-PROBE）算法来估计高维异方差线性回归模型。H-PROBE是一种计算高效的最大后验估计方法，仅需最小先验假设，并能纳入假设影响异质性的协变量。我们应用该方法，通过高维神经影像预测AQ并提供能准确量化预测不确定性的预测区间。分析表明，H-PROBE能在不牺牲覆盖率的条件下提供比标准方法更窄的预测区间宽度。更窄的预测区间对于临床判断中度至重度失语症风险具有重要价值。此外，通过大量模拟研究，我们证明H-PROBE在预测性能、变量选择和预测推断方面均优于其他方法。