In human-AI collaboration systems for critical applications, in order to ensure minimal error, users should set an operating point based on model confidence to determine when the decision should be delegated to human experts. Samples for which model confidence is lower than the operating point would be manually analysed by experts to avoid mistakes. Such systems can become truly useful only if they consider two aspects: models should be confident only for samples for which they are accurate, and the number of samples delegated to experts should be minimized. The latter aspect is especially crucial for applications where available expert time is limited and expensive, such as healthcare. The trade-off between the model accuracy and the number of samples delegated to experts can be represented by a curve that is similar to an ROC curve, which we refer to as confidence operating characteristic (COC) curve. In this paper, we argue that deep neural networks should be trained by taking into account both accuracy and expert load and, to that end, propose a new complementary loss function for classification that maximizes the area under this COC curve. This promotes simultaneously the increase in network accuracy and the reduction in number of samples delegated to humans. We perform experiments on multiple computer vision and medical image datasets for classification. Our results demonstrate that the proposed loss improves classification accuracy and delegates less number of decisions to experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing loss functions.
翻译:在人机协作的关键应用系统中,为确保最小化错误,用户应基于模型置信度设定操作点,以决定何时将决策委托给人类专家。置信度低于该操作点的样本将由专家手动分析,从而避免错误。此类系统只有在考虑两个方面时才能真正发挥作用:模型应仅对自身准确的样本保持高置信度,且委托给专家的样本数量应最小化。后者在专家时间有限且成本高昂的应用(如医疗健康)中尤为关键。模型准确率与委托专家样本数量之间的权衡可通过类似于ROC曲线的曲线表示,我们称之为置信度操作特征(COC)曲线。本文提出,深度神经网络应在训练中同时考虑准确率与专家负荷,并为此提出一种新的互补损失函数,用于最大化该COC曲线下面积。该损失函数同步促进网络准确率的提升和委托人类处理的样本数量的减少。我们在多个计算机视觉和医学影像分类数据集上进行实验。结果表明,与现有损失函数相比,所提损失函数在提升分类准确率、减少委托专家决策数量的同时,实现了更好的分布外样本检测性能,且校准性能相当。