Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments

Medical image segmentation is a fundamental task in numerous medical engineering applications. Recently, language-guided segmentation has shown promise in medical scenarios where textual clinical reports are readily available as semantic guidance. Clinical reports contain diagnostic information provided by clinicians, which can provide auxiliary textual semantics to guide segmentation. However, existing language-guided segmentation methods neglect the inherent pattern gaps between image and text modalities, resulting in sub-optimal visual-language integration. Contrastive learning is a well-recognized approach to align image-text patterns, but it has not been optimized for bridging the pattern gaps in medical language-guided segmentation that relies primarily on medical image details to characterize the underlying disease/targets. Current contrastive alignment techniques typically align high-level global semantics without involving low-level localized target information, and thus cannot deliver fine-grained textual guidance on crucial image details. In this study, we propose a Target-informed Multi-level Contrastive Alignment framework (TMCA) to bridge image-text pattern gaps for medical language-guided segmentation. TMCA enables target-informed image-text alignments and fine-grained textual guidance by introducing: (i) a target-sensitive semantic distance module that utilizes target information for more granular image-text alignment modeling, (ii) a multi-level contrastive alignment strategy that directs fine-grained textual guidance to multi-scale image details, and (iii) a language-guided target enhancement module that reinforces attention to critical image regions based on the aligned image-text patterns. Extensive experiments on four public benchmark datasets demonstrate that TMCA enabled superior performance over state-of-the-art language-guided medical image segmentation methods.

翻译：医学图像分割是众多医学工程应用中的基础任务。近期，语言引导的分割方法在医学场景中展现出潜力，其中文本临床报告可作为现成的语义指导。临床报告包含临床医生提供的诊断信息，能够为分割任务提供辅助性文本语义指导。然而，现有语言引导分割方法忽视了图像与文本模态间固有的模式差异，导致视觉-语言融合效果欠佳。对比学习是公认的对齐图文模式的方法，但现有技术尚未针对医学语言引导分割中的模式差异进行优化——这类分割主要依赖医学图像细节来表征潜在疾病/目标。当前对比对齐技术通常仅对齐高层全局语义，未涉及低层局部目标信息，因此无法对关键图像细节提供细粒度文本指导。本研究提出一种目标感知的多层次对比对齐框架（TMCA），以弥合医学语言引导分割中的图文模式差异。TMCA通过以下创新实现目标感知的图文对齐与细粒度文本指导：（1）目标敏感语义距离模块：利用目标信息进行更细粒度的图文对齐建模；（2）多层次对比对齐策略：将细粒度文本指导定向传递至多尺度图像细节；（3）语言引导的目标增强模块：基于对齐的图文模式强化对关键图像区域的注意力。在四个公开基准数据集上的大量实验表明，TMCA在语言引导医学图像分割任务中实现了超越现有最优方法的性能。