Cross-Modal Causal Intervention for Medical Report Generation

Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance, which can relieve the heavy burden of radiologists by automatically generating the corresponding medical reports according to the given radiology image. However, due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas. Moreover, the cross-modal confounders are usually unobservable and challenging to be eliminated explicitly. In this paper, we aim to mitigate the cross-modal data bias for MRG from a new perspective, i.e., cross-modal causal intervention, and propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM), to implicitly mitigate the visual-linguistic confounders by causal front-door intervention. Specifically, due to the absence of a generalized semantic extractor, the VDM explores and disentangles the visual confounders from the patch-based local and global features without expensive fine-grained annotations. Simultaneously, due to the lack of knowledge encompassing the entire field of medicine, the LDM eliminates the linguistic confounders caused by salient visual features and high-frequency context without constructing a terminology database. Extensive experiments on IU-Xray and MIMIC-CXR datasets show that our VLCI significantly outperforms the state-of-the-art MRG methods. The code and models are available at https://github.com/WissingChen/VLCI.

翻译：医学报告生成（MRG）对于计算机辅助诊断和用药指导至关重要，它能通过根据给定的放射影像自动生成相应医学报告，从而减轻放射科医生的沉重负担。然而，由于视觉和语言偏差导致的图像-文本数据中的虚假相关性，可靠地生成准确描述病灶区域的报告具有挑战性。此外，跨模态混杂因素通常不可观察且难以显式消除。本文旨在从新的视角——即跨模态因果干预——缓解MRG中的跨模态数据偏差，并提出一种新颖的视觉-语言因果干预（VLCI）框架用于MRG，该框架包含视觉去混杂模块（VDM）和语言去混杂模块（LDM），通过因果前门干预隐式缓解视觉-语言混杂因素。具体而言，由于缺乏通用语义提取器，VDM无需昂贵的细粒度标注即可从基于图像块的局部和全局特征中探索并分离视觉混杂因素。同时，由于缺乏涵盖整个医学领域的知识，LDM无需构建术语数据库即可消除由显著视觉特征和高频上下文引起的语言混杂因素。在IU-Xray和MIMIC-CXR数据集上的大量实验表明，我们的VLCI显著优于最先进的MRG方法。代码和模型已开源至https://github.com/WissingChen/VLCI。

相关内容