High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.
翻译:高质量标注数据集对于推动医学图像分析中的机器学习至关重要。然而,当前存在一个关键缺口:大多数数据集要么提供单一、干净的基准真值,这掩盖了真实世界中专家间的分歧;要么提供多个标注,但缺乏独立的金标准以进行客观评估。为填补这一缺口,我们引入了CytoCrowd,一个用于细胞学分析的新公开基准。该数据集包含446张高分辨率图像,每张图像具有两个关键组成部分:(1) 来自四位独立病理学家的原始、存在冲突的标注;(2) 由资深专家建立的独立、高质量的基准真值金标准。这种双重结构使CytoCrowd成为一个多功能资源。一方面,它可作为标准计算机视觉任务(如使用基准真值进行目标检测和分类)的基准。同时,它也为评估必须解决专家分歧的标注聚合算法提供了一个真实的测试平台。我们为这两类任务提供了全面的基线结果。我们的实验展示了CytoCrowd带来的挑战,并确立了其作为开发下一代医学图像分析模型资源的价值。