We introduce a novel uncertainty-aware multimodal segmentation framework that leverages both radiological images and associated clinical text for precise medical diagnosis. We propose a Modality Decoding Attention Block (MoDAB) with a lightweight State Space Mixer (SSMix) to enable efficient cross-modal fusion and long-range dependency modelling. To guide learning under ambiguity, we propose the Spectral-Entropic Uncertainty (SEU) Loss, which jointly captures spatial overlap, spectral consistency, and predictive uncertainty in a unified objective. In complex clinical circumstances with poor image quality, this formulation improves model reliability. Extensive experiments on various publicly available medical datasets, QATA-COVID19, MosMed++, and Kvasir-SEG, demonstrate that our method achieves superior segmentation performance while being significantly more computationally efficient than existing State-of-the-Art (SoTA) approaches. Our results highlight the importance of incorporating uncertainty modelling and structured modality alignment in vision-language medical segmentation tasks. Code: https://github.com/arya-domain/UA-VLS
翻译:我们提出了一种新颖的不确定性感知多模态分割框架,该框架利用放射影像与相关临床文本进行精准医疗诊断。我们设计了一种轻量级状态空间混合器(SSMix)的模态解码注意力块(MoDAB),以实现高效的跨模态融合与长程依赖建模。为在模糊条件下指导学习,我们提出谱熵不确定性(SEU)损失函数,该函数将空间重叠度、谱一致性及预测不确定性联合捕获于统一目标中。在图像质量较差的复杂临床场景下,此方法能显著提升模型可靠性。在多个公开医学数据集(QATA-COVID19、MosMed++及Kvasir-SEG)上的大量实验表明,我们的方法在实现卓越分割性能的同时,计算效率显著优于现有前沿(SoTA)方法。研究结果凸显了在视觉语言医学分割任务中融合不确定性建模与结构化模态对齐的重要性。代码:https://github.com/arya-domain/UA-VLS