The labor-intensive annotation process of semantic segmentation datasets is often prone to errors, since humans struggle to label every pixel correctly. We study algorithms to automatically detect such annotation errors, in particular methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled. This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset, which is critical in sensitive applications such as medical imaging and autonomous vehicles. Widely applicable, our label quality scores rely on probabilistic predictions from a trained segmentation model -- any model architecture and training procedure can be utilized. Here we study 7 different label quality scoring methods used in conjunction with a DeepLabV3+ or a FPN segmentation model to detect annotation errors in a version of the SYNTHIA dataset. Precision-recall evaluations reveal a score -- the soft-minimum of the model-estimated likelihoods of each pixel's annotated class -- that is particularly effective to identify images that are mislabeled, across multiple types of annotation error.
翻译:摘要:语义分割数据集的标注过程劳动密集且易出错,因为人类难以准确标注每个像素。我们研究自动检测此类标注错误的算法,特别是评估标签质量的方法,使得得分最低的图像最不可能被正确标注。这有助于优先审查哪些数据以确保高质量的训练/评估数据集,这在医学影像和自动驾驶等敏感应用中至关重要。我们的标签质量评分方法具有广泛适用性,其依赖于从训练的分割模型中获得的概率预测——任何模型架构和训练流程均可使用。本研究探讨了7种不同的标签质量评分方法,结合DeepLabV3+或FPN分割模型,用于检测SYNTHIA数据集特定版本中的标注错误。精确率-召回率评估显示,一种基于模型估计的每个像素标注类别似然性的软最小值评分方法,特别有效地识别出存在多种标注错误类型的图像。