Cross-Modal Causal Intervention for Medical Report Generation

Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding medical reports according to the given radiology image. However, due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas. Moreover, the cross-modal confounders are usually unobservable and challenging to be eliminated explicitly. In this paper, we aim to mitigate the cross-modal data bias for MRG from a new perspective, i.e., cross-modal causal intervention, and propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM), to implicitly mitigate the visual-linguistic confounders by causal front-door intervention. Specifically, due to the absence of a generalized semantic extractor, the VDM explores and disentangles the visual confounders from the patch-based local and global features without expensive fine-grained annotations. Simultaneously, due to the lack of knowledge encompassing the entire field of medicine, the LDM eliminates the linguistic confounders caused by salient visual features and high-frequency context without constructing a terminology database. Extensive experiments on IU-Xray and MIMIC-CXR datasets show that our VLCI significantly outperforms the state-of-the-art MRG methods. The code and models are available at https://github.com/WissingChen/VLCI.

翻译：医学报告生成对于计算机辅助诊断和用药指导至关重要，它能够根据给定的放射学图像自动生成相应的医学报告，从而减轻放射科医生的沉重负担。然而，由于视觉和语言偏差导致的图像-文本数据中的虚假关联，准确生成可靠描述病灶区域的报告具有挑战性。此外，跨模态的混杂因素通常不可观测，且难以显式消除。本文旨在从跨模态因果干预的新视角缓解医学报告生成中的跨模态数据偏差，并提出一种新颖的视觉-语言因果干预（VLCI）框架，该框架包含视觉去混杂模块（VDM）和语言去混杂模块（LDM），通过因果前门干预隐式缓解视觉-语言混杂因素。具体而言，由于缺乏通用的语义提取器，VDM无需昂贵的细粒度标注即可从基于图块的局部和全局特征中探索并解耦视觉混杂因素。同时，由于缺乏涵盖整个医学领域的知识，LDM无需构建术语数据库即可消除由显著视觉特征和高频上下文引起的语言混杂因素。在IU-Xray和MIMIC-CXR数据集上的大量实验表明，我们的VLCI显著优于最先进的医学报告生成方法。代码和模型可在https://github.com/WissingChen/VLCI获取。

相关内容