Noisy label problems are inevitably in existence within medical image segmentation causing severe performance degradation. Previous segmentation methods for noisy label problems only utilize a single image while the potential of leveraging the correlation between images has been overlooked. Especially for video segmentation, adjacent frames contain rich contextual information beneficial in cognizing noisy labels. Based on two insights, we propose a Multi-Scale Temporal Feature Affinity Learning (MS-TFAL) framework to resolve noisy-labeled medical video segmentation issues. First, we argue the sequential prior of videos is an effective reference, i.e., pixel-level features from adjacent frames are close in distance for the same class and far in distance otherwise. Therefore, Temporal Feature Affinity Learning (TFAL) is devised to indicate possible noisy labels by evaluating the affinity between pixels in two adjacent frames. We also notice that the noise distribution exhibits considerable variations across video, image, and pixel levels. In this way, we introduce Multi-Scale Supervision (MSS) to supervise the network from three different perspectives by re-weighting and refining the samples. This design enables the network to concentrate on clean samples in a coarse-to-fine manner. Experiments with both synthetic and real-world label noise demonstrate that our method outperforms recent state-of-the-art robust segmentation approaches. Code is available at https://github.com/BeileiCui/MS-TFAL.
翻译:医学图像分割中不可避免地存在噪声标签问题,导致分割性能严重下降。现有针对噪声标签问题的分割方法仅利用单张图像,而忽视了借助图像间相关性的潜力。特别是对于视频分割而言,相邻帧包含丰富的上下文信息,有助于识别噪声标签。基于这两个洞见,我们提出多尺度时序特征亲和性学习(MS-TFAL)框架来解决含噪声标签的医学视频分割问题。首先,我们论证视频的时序先验是一种有效参考,即相邻帧中相同类别的像素级特征在距离上接近,而不同类别的则相距较远。因此,我们设计了时序特征亲和性学习(TFAL),通过评估两相邻帧中像素间的亲和性来指示可能的噪声标签。我们还注意到噪声分布在视频、图像和像素层面存在显著差异,并由此引入多尺度监督(MSS),通过对样本进行重加权和精炼,从三个不同视角监督网络。该设计使网络能够以由粗到细的方式聚焦于干净样本。在合成和真实标签噪声上的实验表明,我们的方法优于近期最先进的鲁棒分割方法。代码已开源至https://github.com/BeileiCui/MS-TFAL。