Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
翻译:理解两组放射学图像之间的差异对于生成临床见解和解释医疗人工智能系统至关重要。我们提出了RadDiff,这是一个多模态智能体系统,它执行放射科医生风格的比较推理,以描述成对放射学研究之间具有临床意义的差异。RadDiff建立在VisDiff的提议者-排序器框架之上,并融合了受真实诊断工作流程启发的四项创新:(1) 通过领域自适应视觉语言模型注入医学知识;(2) 集成图像与其临床报告的多模态推理;(3) 跨多轮推理的迭代假设细化;(4) 针对性的视觉搜索,定位并聚焦于显著区域以捕捉细微发现。为了评估RadDiff,我们构建了RadDiffBench,这是一个具有挑战性的基准测试,包含57对经过专家验证的放射学研究及其真实差异描述。在RadDiffBench上,RadDiff达到了47%的准确率,在真实报告的指导下准确率可达50%,显著优于通用领域的VisDiff基线。我们进一步展示了RadDiff在多种临床任务中的通用性,包括COVID-19表型比较、种族亚组分析以及生存相关影像特征的发现。总之,RadDiff和RadDiffBench为系统性地揭示放射学数据中有意义的差异提供了首个方法与基准基础。