As the availability, size and complexity of data have increased in recent years, machine learning (ML) techniques have become popular for modeling. Predictions resulting from applying ML models are often used for inference, decision-making, and downstream applications. A crucial yet often overlooked aspect of ML is uncertainty quantification, which can significantly impact how predictions from models are used and interpreted. Extreme Gradient Boosting (XGBoost) is one of the most popular ML methods given its simple implementation, fast computation, and sequential learning, which make its predictions highly accurate compared to other methods. However, techniques for uncertainty determination in ML models such as XGBoost have not yet been universally agreed among its varying applications. We propose enhancements to XGBoost whereby a modified quantile regression is used as the objective function to estimate uncertainty (QXGBoost). Specifically, we included the Huber norm in the quantile regression model to construct a differentiable approximation to the quantile regression error function. This key step allows XGBoost, which uses a gradient-based optimization algorithm, to make probabilistic predictions efficiently. QXGBoost was applied to create 90\% prediction intervals for one simulated dataset and one real-world environmental dataset of measured traffic noise. Our proposed method had comparable or better performance than the uncertainty estimates generated for regular and quantile light gradient boosting. For both the simulated and traffic noise datasets, the overall performance of the prediction intervals from QXGBoost were better than other models based on coverage width-based criterion.
翻译:近年来,随着数据的可用性、规模和复杂性不断增加,机器学习(ML)技术已成为建模的常用方法。应用机器学习模型产生的预测结果常被用于推理、决策及下游应用。不确定性量化是机器学习中一个关键却常被忽视的环节,它会显著影响模型预测的使用与解读方式。极端梯度提升(XGBoost)因其实现简单、计算快速和序列化学习能力,成为最流行的机器学习方法之一,其预测精度相较其他方法具有明显优势。然而,在不同应用场景下,针对XGBoost等机器学习模型的不确定性量化技术尚未形成统一共识。本文提出对XGBoost的改进方案,通过将改进的分位数回归作为目标函数来实现不确定性估计(QXGBoost)。具体而言,我们在分位数回归模型中引入Huber范数,构建了分位数回归误差函数的可微近似。这一关键步骤使采用梯度优化算法的XGBoost能够高效生成概率预测。我们将QXGBoost应用于一个模拟数据集和一个实测交通噪声的真实环境数据集,构建了90%的预测区间。与常规和分位数轻量梯度提升生成的不确定性估计结果相比,本文提出的方法表现出相当或更优的性能。在模拟数据集和交通噪声数据集上,基于覆盖宽度准则的评估表明,QXGBoost生成的预测区间整体表现均优于其他模型。