Deep learning-based object detectors have achieved impressive performance in microscopy imaging, yet their confidence estimates often lack calibration, limiting their reliability for biomedical applications. In this work, we introduce a new approach to improve model calibration by leveraging multi-rater annotations. We propose to train separate models on the annotations from single experts and aggregate their predictions to emulate consensus. This improves upon label sampling strategies, where models are trained on mixed annotations, and offers a more principled way to capture inter-rater variability. Experiments on a colorectal organoid dataset annotated by two experts demonstrate that our rater-specific ensemble strategy improves calibration performance while maintaining comparable detection accuracy. These findings suggest that explicitly modelling rater disagreement can lead to more trustworthy object detectors in biomedical imaging.
翻译:基于深度学习的目标检测器在显微镜成像领域已取得显著性能,但其置信度估计通常缺乏校准,限制了其在生物医学应用中的可靠性。本研究提出一种利用多标注者注释改进模型校准的新方法。我们建议在单个专家的注释上分别训练独立模型,并通过聚合其预测来模拟共识。该方法改进了标签采样策略(即在混合注释上训练模型),为捕捉标注者间变异性提供了更原则性的途径。在由两位专家标注的结直肠类器官数据集上的实验表明,我们的标注者特异性集成策略在保持可比检测精度的同时,显著提升了校准性能。这些发现表明,显式建模标注者间分歧能够为生物医学成像领域带来更可信赖的目标检测器。