Due to the lack of quality annotation in medical imaging community, semi-supervised learning methods are highly valued in image semantic segmentation tasks. In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented to fully utilize the power of Vision Transformer(ViT) and Convolutional Neural Network(CNN) in semi-supervised learning. Our proposed framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes. The pseudo labels are inferred and utilized recurrently and separately by views of CNN and ViT in the feature-learning module to expand the data set and are beneficial to each other. Meanwhile, a perturbation scheme is designed for the feature-learning module, and averaging network weight is utilized to develop the guidance module. By doing so, the framework combines the feature-learning strength of CNN and ViT, strengthens the performance via dual-view co-training, and enables consistency-aware supervision in a semi-supervised manner. A topological exploration of all alternative supervision modes with CNN and ViT are detailed validated, demonstrating the most promising performance and specific setting of our method on semi-supervised medical image segmentation tasks. Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set with a variety of metrics. The code is publicly available.
翻译:由于医学影像领域缺乏高质量标注数据,半监督学习方法在图像语义分割任务中备受重视。本文提出一种先进的一致性感知伪标签自集成方法,充分挖掘视觉Transformer(ViT)与卷积神经网络(CNN)在半监督学习中的潜力。所提框架包含一个由ViT和CNN相互增强的特征学习模块,以及一个鲁棒的一致性感知引导模块。伪标签在特征学习模块中分别通过CNN和ViT视角循环推断与利用,以扩展数据集并实现相互增益。同时,为特征学习模块设计了扰动方案,并利用网络权重平均构建引导模块。通过这种方式,该框架融合了CNN与ViT的特征学习优势,通过双视角协同训练强化性能,并以半监督方式实现一致性感知监督。对CNN与ViT所有替代监督模式进行了拓扑探索,详细验证了我们方法在半监督医学图像分割任务中的最优性能与具体配置。实验结果表明,该方法在公共基准数据集上以多种指标达到了最先进的性能。相关代码已公开。