While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.
翻译:尽管联邦学习能够在不集中敏感数据的情况下实现协作式医学图像分割,但实际部署常因跨站点的标签缺陷(如轮廓不一致、结构缺失或多余、标签混淆)而复杂化。联邦噪声标签学习旨在缓解这些影响,但由于现有证据主要基于合成噪声、简化设置及有限的实际噪声评估,其在实践中仍未得到充分利用。为填补这一空白,我们引入了一个基准测试套件,该套件结合了多样化的真实噪声数据集、与部署相关的客户端噪声场景以及面向标签噪声的评估体系,以支持系统的联邦噪声标签学习评估与知情的方法选择。该套件整合了来自不同来源的真实噪声医学图像分割数据集,并提供了一个包含多种客户端噪声场景和噪声定向评估的综合联邦分割框架。该套件为医学图像分割中的联邦噪声标签学习评估提供了真实且可区分的基准,并建立了可复用的基础平台,用于公平基准测试、数据集特定标签噪声表征及在真实联邦设置下的未来方法开发。相关代码已开源至 https://github.com/MIC-DKFZ/FedSegNoiseBench。