When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.
翻译:在机器人或其他物理系统上部署深度神经网络时,学习模型应能可靠地量化预测不确定性。可靠的不确定性使下游模块能够评估其动作的安全性。本研究针对此类不确定性的评估指标展开探讨。具体而言,我们聚焦于回归任务,系统研究了稀疏化误差曲线下面积(AUSE)、校准误差、斯皮尔曼秩相关系数以及负对数似然(NLL)等指标。通过使用合成回归数据集,我们深入分析了这些指标在四种典型不确定性类型下的表现特征、其在测试集规模影响下的稳定性,并揭示了各指标的优缺点。研究结果表明,校准误差是最稳定且可解释性最强的指标,但AUSE和NLL在特定场景下仍具应用价值。我们不建议使用斯皮尔曼秩相关系数进行不确定性评估,推荐以AUSE作为其替代方案。