The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model's generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RULE on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. We publicly release our benchmark and code in https://github.com/richard-peng-xia/RULE.
翻译:近期出现的医学大型视觉语言模型(Med-LVLMs)提升了医疗诊断能力。然而,当前的Med-LVLMs常面临事实性问题,其生成的回答往往与既定的医学事实不符。检索增强生成(RAG)通过利用外部知识,能够提升这些模型的事实准确性,但引入了两大挑战。首先,有限的检索上下文可能无法覆盖全部必要信息,而过度的检索则会引入不相关且不准确的参考内容,干扰模型的生成过程。其次,在模型原本能正确回答的情况下,应用RAG可能导致对检索上下文的过度依赖,从而产生错误答案。为解决这些问题,我们提出了RULE框架,其包含两个核心组件。首先,我们引入一种可证明有效的策略,通过校准检索上下文数量来控制事实性风险。其次,基于因过度依赖检索上下文而导致错误的样本,我们构建了一个偏好数据集对模型进行微调,以平衡其在生成过程中对内在知识与检索上下文的依赖。我们在三个医学视觉问答数据集上验证了RULE的有效性,实现了事实准确性平均20.8%的提升。我们的基准测试集与代码已公开于https://github.com/richard-peng-xia/RULE。