The adoption of large language models (LLMs) in healthcare has attracted significant research interest. However, their performance in healthcare remains under-investigated and potentially limited, due to i) they lack rich domain-specific knowledge and medical reasoning skills; and ii) most state-of-the-art LLMs are unimodal, text-only models that cannot directly process multimodal inputs. To this end, we propose a multimodal medical collaborative reasoning framework \textbf{MultiMedRes}, which incorporates a learner agent to proactively gain essential information from domain-specific expert models, to solve medical multimodal reasoning problems. Our method includes three steps: i) \textbf{Inquire}: The learner agent first decomposes given complex medical reasoning problems into multiple domain-specific sub-problems; ii) \textbf{Interact}: The agent then interacts with domain-specific expert models by repeating the ``ask-answer'' process to progressively obtain different domain-specific knowledge; iii) \textbf{Integrate}: The agent finally integrates all the acquired domain-specific knowledge to accurately address the medical reasoning problem. We validate the effectiveness of our method on the task of difference visual question answering for X-ray images. The experiments demonstrate that our zero-shot prediction achieves state-of-the-art performance, and even outperforms the fully supervised methods. Besides, our approach can be incorporated into various LLMs and multimodal LLMs to significantly boost their performance.
翻译:大语言模型在医疗领域的应用已引起广泛研究兴趣。然而,其在医疗领域的性能仍需深入探究且存在潜在局限性,原因包括:i)缺乏丰富的领域特定知识与医学推理能力;ii)当前最先进的大语言模型多为单模态文本模型,无法直接处理多模态输入。为此,我们提出多模态医学协作推理框架 \textbf{MultiMedRes},该框架引入学习者智能体,通过主动从领域特定专家模型中获取关键信息,以解决医学多模态推理难题。本方法包含三个步骤:i)\textbf{探究}:学习者智能体首先将复杂的医学推理问题分解为多个领域特定的子问题;ii)\textbf{交互}:智能体通过重复“提问-回答”过程与领域特定专家模型交互,逐步获取不同领域的专业知识;iii)\textbf{融合}:智能体最终整合所有获取的领域特定知识,精准解决医学推理问题。我们在X光影像差异视觉问答任务上验证了方法的有效性。实验表明,我们的零样本预测达到了当前最优性能,甚至超越了全监督方法。此外,本方法可集成至各类大语言模型及多模态大语言模型中,显著提升其性能。