Assessing the quality of aleatoric uncertainty estimates from uncertainty quantification (UQ) deep learning methods is important in scientific contexts, where uncertainty is physically meaningful and important to characterize and interpret exactly. We systematically compare aleatoric uncertainty measured by two UQ techniques, Deep Ensembles (DE) and Deep Evidential Regression (DER). Our method focuses on both zero-dimensional (0D) and two-dimensional (2D) data, to explore how the UQ methods function for different data dimensionalities. We investigate uncertainty injected on the input and output variables and include a method to propagate uncertainty in the case of input uncertainty so that we can compare the predicted aleatoric uncertainty to the known values. We experiment with three levels of noise. The aleatoric uncertainty predicted across all models and experiments scales with the injected noise level. However, the predicted uncertainty is miscalibrated to $\rm{std}(\sigma_{\rm al})$ with the true uncertainty for half of the DE experiments and almost all of the DER experiments. The predicted uncertainty is the least accurate for both UQ methods for the 2D input uncertainty experiment and the high-noise level. While these results do not apply to more complex data, they highlight that further research on post-facto calibration for these methods would be beneficial, particularly for high-noise and high-dimensional settings.
翻译:在科学应用中评估不确定性量化(UQ)深度学习方法给出的偶然不确定性估计的质量至关重要,因为在这些场景中不确定性具有物理意义,且需要精确表征和解释。我们系统比较了两种UQ技术——深度集成(DE)与深度证据回归(DER)——所度量的偶然不确定性。我们的方法同时关注零维(0D)与二维(2D)数据,以探究UQ方法在不同数据维度下的表现。我们研究了施加在输入与输出变量上的不确定性,并针对输入不确定性的情况引入了一种不确定性传播方法,从而能够将预测的偶然不确定性与已知真值进行比较。我们在三种噪声水平下进行了实验。所有模型与实验中所预测的偶然不确定性均随注入噪声水平的增加而增大。然而,在约半数的DE实验及几乎所有的DER实验中,预测不确定性与真实不确定性相对于 $\rm{std}(\sigma_{\rm al})$ 存在校准偏差。对于2D输入不确定性实验及高噪声水平的情况,两种UQ方法预测的不确定性均表现出最低的准确性。尽管这些结论未必适用于更复杂的数据,但它们凸显了对这些方法进行事后校准的进一步研究将大有裨益,尤其是在高噪声与高维度的应用场景中。