Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance.
翻译:弱监督医学图像分割是一项旨在降低标注成本同时保持分割性能的挑战性任务。本文提出了一种新颖的框架SimTxtSeg,该框架利用简单的文本提示生成高质量伪标签,并同时研究训练分割模型中的跨模态融合问题。我们的贡献包含两个关键组件:一个有效的文本到视觉提示转换器,用于在医学图像上从文本提示生成视觉提示;以及一个具有文本-视觉混合注意力机制的文本引导分割模型,用于融合文本与图像特征。我们在两个医学图像分割任务上评估了我们的框架:结肠息肉分割和MRI脑肿瘤分割,并取得了一致性的最先进性能。