Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.
翻译:大型视觉语言模型(VLM)已从通用应用发展到临床领域等专业用例,显示出在放射学决策支持方面的潜力。一个前景广阔的应用是通过视觉与自然语言问答(VQA)界面分析胸部X光片(CXR)等放射影像数据,从而辅助放射科医生进行决策。当可获得纵向影像时,放射科医生会分析时间变化,这对于准确诊断和预后至关重要。人工纵向分析是一个耗时的过程,这推动了能够提供预后能力的训练框架的开发。我们提出了一种新颖的训练框架LUMEN,该框架针对纵向CXR解读进行了优化,利用多图像与多任务指令微调来提升预后与诊断性能。我们在公开可用的MIMIC-CXR及其关联的Medical-Diff-VQA数据集上进行了实验。我们进一步构建了一个包含纵向研究的新型指令遵循数据集,从而支持预后VQA任务的开发。我们的方法在诊断性VQA任务上相比基线模型展现出显著改进,更重要的是,在预后能力方面显示出有前景的潜力。这些结果凸显了精心设计、经过指令微调的VLM在实现对纵向放射影像数据进行更准确且具有临床意义的放射学解读方面的价值。