Identifying Label Errors in Object Detection Datasets by Loss Inspection

Labeling datasets for supervised object detection is a dull and time-consuming task. Errors can be easily introduced during annotation and overlooked during review, yielding inaccurate benchmarks and performance degradation of deep neural networks trained on noisy labels. In this work, we for the first time introduce a benchmark for label error detection methods on object detection datasets as well as a label error detection method and a number of baselines. We simulate four different types of randomly introduced label errors on train and test sets of well-labeled object detection datasets. For our label error detection method we assume a two-stage object detector to be given and consider the sum of both stages' classification and regression losses. The losses are computed with respect to the predictions and the noisy labels including simulated label errors, aiming at detecting the latter. We compare our method to three baselines: a naive one without deep learning, the object detector's score and the entropy of the classification softmax distribution. We outperform all baselines and demonstrate that among the considered methods, ours is the only one that detects label errors of all four types efficiently. Furthermore, we detect real label errors a) on commonly used test datasets in object detection and b) on a proprietary dataset. In both cases we achieve low false positives rates, i.e., when considering 200 proposals from our method, we detect label errors with a precision for a) of up to 71.5% and for b) with 97%.

翻译：为监督式目标检测标注数据集是一项枯燥且耗时的任务。标注过程中容易引入错误，且在审核时被忽略，从而导致基准测试不准确，并降低基于噪声标签训练的深度神经网络的性能。本研究首次为目标检测数据集上的标签错误检测方法引入了一个基准测试，同时提出了一种标签错误检测方法及多个基线方法。我们在标注良好的目标检测数据集的训练集和测试集上模拟了四种不同类型的随机引入标签错误。对于我们的标签错误检测方法，假设已给定一个两阶段目标检测器，并考虑两个阶段的分类损失与回归损失之和。这些损失基于预测结果与包含模拟标签错误的噪声标签计算，旨在检测后者。我们将所提方法与三种基线方法进行比较：一种不使用深度学习的朴素方法、目标检测器的置信度得分以及分类 softmax 分布的熵。我们超越了所有基线方法，并证明在所考虑的方法中，只有我们的方法能高效检测所有四种类型的标签错误。此外，我们还在以下场景中检测了真实标签错误：a) 目标检测中常用的测试数据集，b) 专有数据集。两种情况下，我们都实现了较低的误报率，即当从我们的方法中选取 200 个候选时，a) 的检测精确率高达 71.5%，b) 为 97%。