While dense pixel-wise annotations remain the gold standard for medical image segmentation, they are costly to obtain and limit scalability. In contrast, many deployed systems already produce inexpensive automatic quality-control (QC) signals like model agreement, uncertainty measures, or learned mask-quality scores which can be used for further model training without additional ground-truth annotation. However, these signals can be noisy and biased, making preference-based fine-tuning susceptible to harmful updates. We study Direct Preference Optimization (DPO) for segmentation from such noisy judges using proposals generated by a supervised base segmenter trained on a small labeled set. We find that outcomes depend strongly on how preference pairs are mined: selecting the judge's top-ranked proposal can improve peak performance when the judge is reliable, but can amplify harmful errors under weaker judges. We propose Region-Normalized DPO (RN-DPO), a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks, reducing the leverage of harmful comparisons and improving optimization stability. Across two medical datasets and multiple regimes, RN-DPO improves sustained performance and stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.
翻译:尽管密集像素级标注仍是医学图像分割的金标准,但其获取成本高昂且限制了可扩展性。相比之下,许多已部署系统能够产生低成本的自动质量控制信号,例如模型一致性、不确定性度量或学习的掩码质量分数,这些信号可用于进一步模型训练而无需额外真实标注。然而,这些信号可能存在噪声和偏差,使得基于偏好的微调容易受到有害更新的影响。本研究利用在小型标注集上训练的有监督基础分割器生成的候选结果,探索了针对此类噪声评判的医学图像分割直接偏好优化方法。我们发现结果高度依赖于偏好对的挖掘方式:当评判者可靠时,选择其排名最高的候选结果能够提升峰值性能,但在评判者较弱时可能放大有害误差。本文提出区域归一化DPO——一种分割感知的目标函数,通过掩码间差异区域的尺寸对偏好更新进行归一化处理,从而降低有害比较的影响并提升优化稳定性。在两个医学数据集和多种实验设定下的结果表明,RN-DPO在无需额外像素标注的情况下,能够提升持续性能并稳定基于偏好的微调过程,其表现优于标准DPO及多种强基线方法。