Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.
翻译:摘要:视觉语言模型凭借其利用文本类别提示生成分割图的能力,在提升半监督脊柱分割中伪标签质量方面具有巨大潜力,然而目前尚无相关研究。尽管前景可期,但现有方法缺乏显式约束来确保脊柱类别提示与脊柱单元区域间的一致性,导致多类别分割图生成性能欠佳。本文提出CPS4——首个采用类别提示增强脊柱伪标签质量的文本引导半监督脊柱分割网络。具体而言,CPS4通过两个训练阶段实现:(i)类别特异性一致性约束的VLM预训练阶段:我们提出令牌级与像素级注意力损失,优化类别提示与脊柱单元间的一致性,迫使文本类别提示在语义空间中与目标脊柱单元紧密耦合;(ii)类别提示驱动的半监督脊柱分割阶段:利用预训练的视觉-文本编码器,为未标注脊柱图像推导出每个类别对应的二值分割图,并将其整合为统一的多类别分割图,从而提升半监督脊柱分割网络生成的伪标签质量。实验结果表明,在公开脊柱分割数据集上仅使用5%标注数据时,CPS4以80.44%的Dice系数实现了卓越的脊柱分割性能,超越了主流半监督学习及VLM方法。我们的代码将公开提供。