Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.
翻译:可信的临床摘要不仅需要流畅的生成,还需明确陈述来源。我们提出了一种无需训练、在生成时进行来源归因的框架,该框架利用解码器注意力机制直接引用支持性文本片段或图像,克服了事后归因或基于重训练方法的局限性。我们引入了两种多模态归因策略:原始图像模式,直接利用图像块注意力;以及标题即片段模式,用生成的标题替代图像,实现纯文本对齐。在两个代表性领域进行评估:临床医患对话(CliConSummation)和放射学报告(MIMIC-CXR),结果表明,我们的方法在基于嵌入和自归因基线方法上均表现更优,提升了文本层面和多模态归因的准确性(例如,F1分数比嵌入基线提高15%)。基于标题的归因在保持与原始图像注意力相当性能的同时,更为轻量且实用。这些发现凸显了注意力引导的归因作为迈向可解释、可部署临床摘要系统的重要一步。