Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

翻译：用于医学图像分割的深度学习模型，其临床实用性因其无法泛化到未见域而受到严重制约。这种失败通常源于模型学习了解剖内容与特定领域成像风格之间的虚假相关性。为了克服这一根本性挑战，我们提出了Causal-SAM-LLM，这是一个新颖的框架，它将大语言模型提升为因果推理器的角色。我们的框架建立在一个冻结的Segment Anything Model编码器之上，并融入了两项协同创新的技术。首先，语言对抗解耦利用视觉语言模型生成关于混杂图像风格的丰富文本描述。通过训练分割模型的特征使其与这些风格描述形成对比性差异，模型学习到一种被鲁棒地清除了非因果信息的表示。其次，测试时因果干预提供了一种交互机制，其中一个大语言模型解释临床医生的自然语言指令，以实时调制分割解码器的特征，从而实现有针对性的误差校正。我们在一个由四个公共数据集组成的复合基准上进行了广泛的实证评估，评估了跨扫描仪、跨模态和跨解剖结构设置下的泛化能力。Causal-SAM-LLM在分布外鲁棒性方面确立了新的技术水平，与最强的基线相比，平均Dice分数提升了高达6.2个百分点，并将Hausdorff距离减少了15.8毫米，同时仅使用了完整模型不到9%的可训练参数。我们的工作为构建鲁棒、高效且可交互控制的医学人工智能系统开辟了一条新路径。