Dual-modal Dynamic Traceback Learning for Medical Report Generation

With increasing reliance on medical imaging in clinical practices, automated report generation from medical images is in great demand. Existing report generation methods typically adopt an encoder-decoder deep learning framework to build a uni-directional image-to-report mapping. However, such a framework ignores the bi-directional mutual associations between images and reports, thus incurring difficulties in associating the intrinsic medical meanings between them. Recent generative representation learning methods have demonstrated the benefits of dual-modal learning from both image and text modalities. However, these methods exhibit two major drawbacks for medical report generation: 1) they tend to capture morphological information and have difficulties in capturing subtle pathological semantic information, and 2) they predict masked text rely on both unmasked images and text, inevitably degrading performance when inference is based solely on images. In this study, we propose a new report generation framework with dual-modal dynamic traceback learning (DTrace) to overcome the two identified drawbacks and enable dual-modal learning for medical report generation. To achieve this, our DTrace introduces a traceback mechanism to control the semantic validity of generated content via self-assessment. Further, our DTrace introduces a dynamic learning strategy to adapt to various proportions of image and text input, enabling report generation without reliance on textual input during inference. Extensive experiments on two well-benchmarked datasets (IU-Xray and MIMIC-CXR) show that our DTrace outperforms state-of-the-art medical report generation methods.

翻译：随着临床实践中对医学影像依赖的增加，从医学图像自动生成报告的需求日益迫切。现有报告生成方法通常采用编码器-解码器深度学习框架构建单向的图像到报告映射。然而，这种框架忽略了图像与报告之间的双向相互关联，从而难以建立两者间内在的医学语义联系。近期生成式表征学习方法已证明从图像和文本两种模态进行双模态学习的优势。然而，这些方法在医学报告生成中存在两个主要缺陷：1）它们倾向于捕捉形态学信息，难以捕捉细微的病理语义信息；2）它们预测掩码文本时依赖未掩码的图像和文本，当推理仅基于图像时不可避免地导致性能下降。本研究提出一种基于双模态动态回溯学习（DTrace）的新型报告生成框架，以克服上述两个缺陷，实现用于医学报告生成的双模态学习。为此，我们的DTrace引入回溯机制，通过自评估控制生成内容的语义有效性。此外，DTrace引入动态学习策略以适应不同比例的图像和文本输入，使报告生成在推理阶段无需依赖文本输入。在两个权威基准数据集（IU-Xray和MIMIC-CXR）上的大量实验表明，我们的DTrace优于现有最先进的医学报告生成方法。