Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays

Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. This task is particularly important in reading chest X-ray images because radiologists often compare multiple images of the same patient taken at different times to track disease progression and changes in its severity in their clinical practice. However, previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance using a pretrained vision-language model (VLM). Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task. The model is developed using a step-by-step approach, starting with being pretrained on natural images and texts, followed by being trained using longitudinal chest X-ray data. The longitudinal data consist of pairs of X-ray images, along with question-answer sets and radiologist's reports that describe the changes in lung abnormalities and diseases over time. Our experimental results show that the PLURAL model outperforms state-of-the-art methods not only in diff-VQA for longitudinal X-rays but also in conventional VQA for a single X-ray image. Through extensive experiments, we demonstrate the effectiveness of the proposed VLM architecture and pretraining method in improving the model's performance.

翻译：差异视觉问答（diff-VQA）是一项具有挑战性的任务，需要基于一对图像之间的差异回答复杂问题。该任务在胸部X光片判读中尤为重要，因为放射科医生在临床实践中常通过比较同一患者不同时期的多张图像来追踪疾病进展及其严重程度变化。然而，以往研究专注于为diff-VQA任务设计特定的网络架构，未能利用预训练的视觉语言模型（VLM）提升模型性能。本文提出一种名为PLURAL的新型VLM，该模型在自然图像与纵向胸部X光数据上针对diff-VQA任务进行预训练。模型采用渐进式开发策略：首先在自然图像与文本上进行预训练，随后利用纵向胸部X光数据训练。纵向数据包含成对的X光图像，以及描述肺异常与疾病随时间变化的问题-答案集和放射科报告。实验结果表明，PLURAL模型不仅在纵向X光片的diff-VQA任务中超越现有最优方法，在单张X光片的传统VQA任务中也表现更佳。通过大量实验，我们验证了所提VLM架构与预训练方法在提升模型性能方面的有效性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日