Differential medical VQA models compare multiple images to identify clinically meaningful changes and rely on vision encoders to capture fine-grained visual differences that reflect radiologists' comparative diagnostic workflows. However, vision encoders trained using standard contrastive or classification objectives often fail to capture the subtle variations needed to distinguish true disease progression from acquisition-related variability. To address this limitation, we introduce a location-aware pretraining framework that incorporates automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These tasks promote the learning of fine-grained, spatially grounded visual representations. When integrated with a language model, our approach achieves state-of-the-art performance on medical difference VQA by accurately identifying and reasoning about clinically relevant changes in chest X-ray images.
翻译:差异化医学VQA模型通过比较多张图像识别具有临床意义的变化,并依赖视觉编码器捕捉反映放射科医生比较诊断工作流程的细粒度视觉差异。然而,使用标准对比或分类目标训练的视觉编码器往往无法捕捉区分真实疾病进展与采集相关变异所需的细微变化。为解决这一局限性,我们提出了一种位置感知预训练框架,融合了自动指代表达(AREF)、基础描述(GCAP)和条件自动指代表达(CAREF)。这些任务促进学习细粒度、空间基础的视觉表征。当与语言模型集成时,我们的方法通过准确识别和推理胸部X光图像中临床相关的变化,在医学差异VQA上达到了最先进的性能。