When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.
翻译:在机器人或其他物理系统上部署深度神经网络时,学习模型应可靠地量化预测不确定性。可靠的不确定性使下游模块能够推理其动作的安全性。本文研究评估此类不确定性的指标,具体关注回归任务,分析稀疏化误差下面积(AUSE)、校准误差、斯皮尔曼秩相关系数和负对数似然(NLL)。利用合成回归数据集,我们探讨了这些指标在四种典型不确定性类型下的表现、随测试集大小的稳定性,并揭示了各自的优缺点。结果表明,校准误差是最稳定且最可解释的指标,但AUSE和NLL也各有适用场景。我们建议避免使用斯皮尔曼秩相关系数评估不确定性,推荐以AUSE替代。