Deep learning has been widely used in medical image segmentation and other aspects. However, the performance of existing medical image segmentation models has been limited by the challenge of obtaining sufficient high-quality labeled data due to the prohibitive data annotation cost. To alleviate this limitation, we propose a new text-augmented medical image segmentation model LViT (Language meets Vision Transformer). In our LViT model, medical text annotation is incorporated to compensate for the quality deficiency in image data. In addition, the text information can guide to generate pseudo labels of improved quality in the semi-supervised learning. We also propose an Exponential Pseudo label Iteration mechanism (EPI) to help the Pixel-Level Attention Module (PLAM) preserve local image features in semi-supervised LViT setting. In our model, LV (Language-Vision) loss is designed to supervise the training of unlabeled images using text information directly. For evaluation, we construct three multimodal medical segmentation datasets (image + text) containing X-rays and CT images. Experimental results show that our proposed LViT has superior segmentation performance in both fully-supervised and semi-supervised setting. The code and datasets are available at https://github.com/HUANGLIZI/LViT.
翻译:深度学习已广泛应用于医学图像分割等领域。然而,由于数据标注成本高昂,难以获取足够的高质量标注数据,现有医学图像分割模型的性能受到限制。为缓解这一局限,我们提出了一种新的文本增强型医学图像分割模型LViT(语言与视觉Transformer的结合)。在LViT模型中,我们引入医学文本注释以补偿图像数据质量不足的问题。此外,在半监督学习中,文本信息可指导生成更高质量的伪标签。我们还提出了一种指数伪标签迭代机制(EPI),以帮助像素级注意力模块(PLAM)在半监督LViT设置中保留局部图像特征。在我们的模型中,设计了语言-视觉损失(LV loss),以直接利用文本信息监督未标注图像的训练。为进行评估,我们构建了包含X光片和CT图像在内的三种多模态医学分割数据集(图像+文本)。实验结果表明,所提出的LViT在全监督和半监督设置下均具有优越的分割性能。代码和数据集可在https://github.com/HUANGLIZI/LViT获取。