Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. This task is particularly important in reading chest X-ray images because radiologists often compare multiple images of the same patient taken at different times to track disease progression and changes in its severity in their clinical practice. However, previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance using a pretrained vision-language model (VLM). Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task. The model is developed using a step-by-step approach, starting with being pretrained on natural images and texts, followed by being trained using longitudinal chest X-ray data. The longitudinal data consist of pairs of X-ray images, along with question-answer sets and radiologist's reports that describe the changes in lung abnormalities and diseases over time. Our experimental results show that the PLURAL model outperforms state-of-the-art methods not only in diff-VQA for longitudinal X-rays but also in conventional VQA for a single X-ray image. Through extensive experiments, we demonstrate the effectiveness of the proposed VLM architecture and pretraining method in improving the model's performance.
翻译:差异视觉问答(diff-VQA)是一项具有挑战性的任务,需要基于一对图像之间的差异回答复杂问题。该任务在胸部X光片判读中尤为重要,因为放射科医生在临床实践中常通过比较同一患者不同时期的多张图像来追踪疾病进展及其严重程度变化。然而,以往研究专注于为diff-VQA任务设计特定的网络架构,未能利用预训练的视觉语言模型(VLM)提升模型性能。本文提出一种名为PLURAL的新型VLM,该模型在自然图像与纵向胸部X光数据上针对diff-VQA任务进行预训练。模型采用渐进式开发策略:首先在自然图像与文本上进行预训练,随后利用纵向胸部X光数据训练。纵向数据包含成对的X光图像,以及描述肺异常与疾病随时间变化的问题-答案集和放射科报告。实验结果表明,PLURAL模型不仅在纵向X光片的diff-VQA任务中超越现有最优方法,在单张X光片的传统VQA任务中也表现更佳。通过大量实验,我们验证了所提VLM架构与预训练方法在提升模型性能方面的有效性。