Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.
翻译:自动放射报告总结旨在将冗长的发现浓缩为简洁的临床印象,但现有多模态模型常受视觉噪声困扰,且在"发现→印象"转化中未能实质性地超越强文本基线模型。我们挑战两个主流假设:(1)视觉输入越多越好;(2)当发现已包含丰富的图像衍生细节时,多模态模型增益有限。通过在MIMIC-CXR基准上的可控消融实验,我们发现选择性关注病理相关视觉区域(而非全图)可显著提升性能。我们提出ViTAS(视觉-文本注意力总结器),这是一个多阶段流水线,集成MedSAM2引导的肺部分割、双向交叉注意力多视图融合、沙普利引导的自适应补丁聚类及层级视觉分词化输入ViT。ViTAS在BLEU-4(29.25%)和ROUGE-L(69.83%)上达到最优结果,定性分析中事实对齐增强,且获得专家评分的最高人类评估值。我们的研究表明,更少但更相关的视觉输入不仅能满足要求,且在多模态放射学总结中表现更优。