Most benchmark datasets for computer vision contain irrelevant images, near duplicates, and label errors. Consequently, model performance on these benchmarks may not be an accurate estimate of generalization capabilities. This is a particularly acute concern in computer vision for medicine where datasets are typically small, stakes are high, and annotation processes are expensive and error-prone. In this paper we propose SelfClean, a general procedure to clean up image datasets exploiting a latent space learned with self-supervision. By relying on self-supervised learning, our approach focuses on intrinsic properties of the data and avoids annotation biases. We formulate dataset cleaning as either a set of ranking problems, which significantly reduce human annotation effort, or a set of scoring problems, which enable fully automated decisions based on score distributions. We demonstrate that SelfClean achieves state-of-the-art performance in detecting irrelevant images, near duplicates, and label errors within popular computer vision benchmarks, retrieving both injected synthetic noise and natural contamination. In addition, we apply our method to multiple image datasets and confirm an improvement in evaluation reliability.
翻译:大多数计算机视觉基准数据集包含无关图像、近似重复图像和标签错误。因此,模型在这些基准上的性能可能无法准确评估其泛化能力。这一问题在医学计算机视觉领域尤为突出,因为该领域的数据集通常规模较小、容错要求高、标注过程昂贵且易出错。本文提出了SelfClean——一种利用自监督学习隐空间来清理图像数据集的通用流程。通过依赖自监督学习,我们的方法聚焦于数据的内在属性,避免了标注偏差。我们将数据集清洗表述为一组排序问题(可显著减少人工标注工作量)或一组评分问题(可基于分数分布实现全自动决策)。我们证明,SelfClean在检测流行计算机视觉基准中的无关图像、近似重复图像和标签错误方面达到了最先进的性能,能够有效识别注入的合成噪声与自然污染。此外,我们将该方法应用于多个图像数据集,证实了其在提升评估可靠性方面的有效性。