Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased, while those for small-valued outcomes are positively biased. We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this issue persists across various machine learning models, and then delve into its theoretical underpinnings. We propose a general constrained optimization approach designed to correct this bias and develop a computationally efficient algorithm to implement our method. Our simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.
翻译:针对连续结果的机器学习模型常产生系统性偏差预测,尤其对于显著偏离均值的数值。具体而言,大数值结果的预测往往呈现负偏差,而小数值结果的预测则呈现正偏差。我们将这种线性中心趋势扭曲偏差称为"机器学习回归的系统性偏差"。本文首先论证该问题在不同机器学习模型中普遍存在,继而深入探讨其理论基础。我们提出一种通用的约束优化方法以校正此偏差,并开发计算高效的算法实现该方法。仿真结果表明我们的校正方法能有效消除预测结果的偏差。我们将所提方法应用于神经影像数据的脑年龄预测。相较于其他机器学习模型,本方法有效解决了基于神经影像的脑年龄计算中长期存在的"机器学习回归系统性偏差"问题,实现了无偏的脑年龄预测。