Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR). This paper introduces \textbf{MedPromptX}, the first model to integrate multimodal large language models (MLLMs), few-shot prompting (FP) and visual grounding (VG) to combine imagery with EHR data for chest X-ray diagnosis. A pre-trained MLLM is utilized to complement the missing EHR information, providing a comprehensive understanding of patients' medical history. Additionally, FP reduces the necessity for extensive training of MLLMs while effectively tackling the issue of hallucination. Nevertheless, the process of determining the optimal number of few-shot examples and selecting high-quality candidates can be burdensome, yet it profoundly influences model performance. Hence, we propose a new technique that dynamically refines few-shot data for real-time adjustment to new patient scenarios. Moreover, VG aids in focusing the model's attention on relevant regions of interest in X-ray images, enhancing the identification of abnormalities. We release MedPromptX-VQA, a new in-context visual question answering dataset encompassing interleaved image and EHR data derived from MIMIC-IV and MIMIC-CXR databases. Results demonstrate the SOTA performance of MedPromptX, achieving an 11% improvement in F1-score compared to the baselines. Code and data are available at https://github.com/BioMedIA-MBZUAI/MedPromptX
翻译:胸部X光影像常用于预测急性和慢性心肺疾病,但将其与结构化临床数据整合的努力因电子健康记录(EHR)不完整而面临挑战。本文提出**MedPromptX**——首个将多模态大语言模型(MLLM)、小样本提示(FP)和视觉接地(VG)相结合,以融合影像与EHR数据进行胸部X光诊断的模型。利用预训练MLLM补充缺失的EHR信息,实现对患者病史的全面理解。此外,FP降低了MLLM大规模训练的需求,同时有效解决幻觉问题。然而,确定最优小样本示例数量及筛选高质量候选示例的过程可能繁琐,却深刻影响模型性能。为此,我们提出一种新技术,动态优化小样本数据以适应新患者场景的实时调整。同时,VG有助于引导模型关注X光影像中相关感兴趣区域,增强异常识别能力。我们发布了MedPromptX-VQA——一个基于MIMIC-IV和MIMIC-CXR数据库、包含交错影像与EHR数据的新颖上下文视觉问答数据集。结果表明,MedPromptX达到最先进水平,其F1分数相比基准模型提升11%。代码与数据详见https://github.com/BioMedIA-MBZUAI/MedPromptX