Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.
翻译:深度学习高度依赖数据增强来缓解数据不足的问题,在医学影像领域尤为如此。近年来,多模态学习通过融合文本与图像进行分割,即指代或文本引导的图像分割。然而,常见的增强方法(如旋转和翻转)会破坏图像与文本之间的空间对齐关系,从而削弱模型性能。为解决这一问题,我们提出了一种早期融合框架,在数据增强之前融合文本与视觉特征,以保持空间一致性。我们还设计了一个轻量级生成器,将文本嵌入投影到视觉空间,以弥合语义鸿沟。对生成的伪图像进行可视化分析,结果显示其能准确定位目标区域。我们在三项医学影像任务和四种分割框架上评估了所提方法,均取得了最先进的性能。代码已在GitHub上公开:https://github.com/11yxk/MedSeg_EarlyFusion。