Retrieving visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. However, multimodal retrieval-augmented diagnosis is highly challenging. We explore a lightweight mechanism for enhancing diagnostic performance of retrieval-augmented LVLMs. We train a lightweight LVLM-aware multimodal retriever, such that the retriever learns to return images and texts that guide the LVLM toward correct predictions. In our low-resource setting, we perform only lightweight fine-tuning with small amounts of data, and use only general-purpose backbone models, achieving competitive results in clinical classification and VQA tasks compared to medically pre-trained models with extensive training. In a novel analysis, we highlight a previously unexplored class of errors that we term inconsistent retrieval predictions: cases where different top-retrieved images yield different predictions for the same target. We find that these cases are challenging for all models, even for non-retrieval models, and that our retrieval optimization mechanism significantly improves these cases over standard RAG. However, our analysis also sheds light on gaps in the ability of LVLMs to utilize retrieved information for clinical predictions. Code and models available at: https://github.com/Nirmaz/JOMED.
翻译:从医学文献和医院记录中检索视觉与文本信息,能够提升临床影像解读的诊断准确性。然而,多模态检索增强诊断面临巨大挑战。本文探索了一种轻量级机制,用于提升检索增强型LVLM的诊断性能。我们训练了一个轻量级的LVLM感知多模态检索器,使其学会返回能够引导LVLM做出正确预测的图像和文本。在低资源设定下,我们仅使用少量数据进行轻量级微调,并仅采用通用骨干模型,在临床分类和视觉问答任务上取得了与经过大量训练的医学预训练模型相竞争的结果。在一项新颖的分析中,我们揭示了一类此前未被探索的误差,我们称之为不一致检索预测:即针对同一目标,不同顶部检索图像导致不同预测结果的情况。我们发现这类情况对所有模型(包括非检索模型)都具有挑战性,而我们的检索优化机制在此类案例上相比标准RAG取得了显著改进。然而,我们的分析也揭示了LVLM在利用检索信息进行临床预测方面仍存在能力局限。代码与模型发布于:https://github.com/Nirmaz/JOMED。