Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.
翻译:自动放射学报告生成因其在减轻放射科医师工作负担方面的实用价值而吸引了大量研究兴趣。然而,同时建立图像(如胸部X光片)与其相关报告之间的全局对应关系,以及图像块与关键词之间的局部对齐,仍然具有挑战性。为此,我们提出了一种“统一、对齐与细化”(UAR)方法,以学习多层级跨模态对齐,并引入了三个新颖模块:潜在空间统一器(LSU)、跨模态表示对齐器(CRA)和文本到图像细化器(TIR)。具体来说,LSU将多模态数据统一为离散标记,使其能够灵活地通过共享网络学习模态间的共性知识。与模态无关的CRA首先通过一组正交基和双门控机制学习判别性特征,然后在三元组对比损失下全局对齐视觉与文本表示。TIR通过可学习掩码校准文本到图像的注意力,从而增强标记级别的局部对齐。此外,我们设计了一个两阶段训练流程,使UAR逐步掌握不同层级的跨模态对齐,这模仿了放射科医师的工作流程:先逐句撰写,再逐词检查。在IU-Xray和MIMIC-CXR基准数据集上的大量实验与分析表明,我们的UAR相较于各种最先进方法具有优越性。