CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation

Automatic and precise medical image segmentation (MIS) is of vital importance for clinical diagnosis and analysis. Current MIS methods mainly rely on the convolutional neural network (CNN) or self-attention mechanism (Transformer) for feature modeling. However, CNN-based methods suffer from the inaccurate localization owing to the limited global dependency while Transformer-based methods always present the coarse boundary for the lack of local emphasis. Although some CNN-Transformer hybrid methods are designed to synthesize the complementary local and global information for better performance, the combination of CNN and Transformer introduces numerous parameters and increases the computation cost. To this end, this paper proposes a CNN-Transformer rectified collaborative learning (CTRCL) framework to learn stronger CNN-based and Transformer-based models for MIS tasks via the bi-directional knowledge transfer between them. Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels for accurate knowledge transfer in the logit space. We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space by granting their intermediate features the similar capability of category perception. Extensive experiments on three popular MIS benchmarks demonstrate that our CTRCL outperforms most state-of-the-art collaborative learning methods under different evaluation metrics.

翻译：自动且精确的医学图像分割对于临床诊断与分析至关重要。当前的医学图像分割方法主要依赖卷积神经网络或自注意力机制进行特征建模。然而，基于 CNN 的方法由于全局依赖关系有限而存在定位不准确的问题，而基于 Transformer 的方法则常因缺乏局部关注而呈现粗糙的边界。尽管一些 CNN-Transformer 混合方法被设计用于综合互补的局部与全局信息以获得更好的性能，但 CNN 与 Transformer 的结合引入了大量参数并增加了计算成本。为此，本文提出了一种 CNN-Transformer 修正协同学习框架，旨在通过两者之间的双向知识迁移，为医学图像分割任务学习更强的基于 CNN 和基于 Transformer 的模型。具体而言，我们提出了一种修正的 logit 层面协同学习策略，该策略引入真实标签来自适应地选择并修正学生软标签中的错误区域，以实现 logit 空间中的精确知识迁移。我们还提出了一种类感知的特征层面协同学习策略，通过赋予基于 CNN 和基于 Transformer 模型的中间特征以相似的类别感知能力，来实现特征空间中的有效知识迁移。在三个流行的医学图像分割基准数据集上进行的大量实验表明，在不同的评估指标下，我们的 CTRCL 框架优于大多数最先进的协同学习方法。