Recent developments in deep learning (DL) techniques have led to great performance improvement in medical image segmentation tasks, especially with the latest Transformer model and its variants. While labels from fusing multi-rater manual segmentations are often employed as ideal ground truths in DL model training, inter-rater variability due to factors such as training bias, image noise, and extreme anatomical variability can still affect the performance and uncertainty of the resulting algorithms. Knowledge regarding how inter-rater variability affects the reliability of the resulting DL algorithms, a key element in clinical deployment, can help inform better training data construction and DL models, but has not been explored extensively. In this paper, we measure aleatoric and epistemic uncertainties using test-time augmentation (TTA), test-time dropout (TTD), and deep ensemble to explore their relationship with inter-rater variability. Furthermore, we compare UNet and TransUNet to study the impacts of Transformers on model uncertainty with two label fusion strategies. We conduct a case study using multi-class paraspinal muscle segmentation from T2w MRIs. Our study reveals the interplay between inter-rater variability and uncertainties, affected by choices of label fusion strategies and DL models.
翻译:摘要:深度学习技术的近期发展大幅提升了医学图像分割任务的性能,尤其是最新的Transformer模型及其变体。尽管多评分者手动分割融合生成的标签常被用作深度学习模型训练的理想金标准,但由训练偏差、图像噪声及极端解剖变异等因素导致的评分者间变异性仍会影响算法性能及其不确定性。关于评分者间变异性如何影响深度学习算法可靠性(这是临床部署的关键要素)的知识,有助于指导更优的训练数据构建与模型设计,但相关研究尚不充分。本文通过测试时增强、测试时丢弃与深度集成方法测量随机不确定性和认知不确定性,探索其与评分者间变异性的关系。此外,我们对比UNet与TransUNet,结合两种标签融合策略研究Transformer对模型不确定性的影响。基于T2加权MRI的多类别椎旁肌肉分割开展案例研究,揭示了评分者间变异性与不确定性之间的相互作用,且该作用受标签融合策略与深度学习模型选择的影响。