Medical dialogue systems aim to provide accurate answers to patients, necessitating specific domain knowledge. Recent advancements in Large Language Models (LLMs) have demonstrated their exceptional capabilities in the medical Q&A domain, indicating a rich understanding of common sense. However, LLMs are insufficient for direct diagnosis due to the absence of diagnostic strategies. The conventional approach to address this challenge involves expensive fine-tuning of LLMs. Alternatively, a more appealing solution is the development of a plugin that empowers LLMs to perform medical conversation tasks. Drawing inspiration from in-context learning, we propose PlugMed, a Plug-and-Play Medical Dialogue System that facilitates appropriate dialogue actions by LLMs through two modules: the prompt generation (PG) module and the response ranking (RR) module. The PG module is designed to capture dialogue information from both global and local perspectives. It selects suitable prompts by assessing their similarity to the entire dialogue history and recent utterances grouped by patient symptoms, respectively. Additionally, the RR module incorporates fine-tuned SLMs as response filters and selects appropriate responses generated by LLMs. Moreover, we devise a novel evaluation method based on intent and medical entities matching to assess the efficacy of dialogue strategies in medical conversations more effectively. Experimental evaluations conducted on three unlabeled medical dialogue datasets, including both automatic and manual assessments, demonstrate that our model surpasses the strong fine-tuning baselines.
翻译:医疗对话系统旨在为患者提供准确的答复,这需要特定的领域知识。近期大型语言模型(LLMs)的进展表明,它们在医疗问答领域具有卓越能力,显示出对常识的丰富理解。然而,由于缺乏诊断策略,LLMs不足以直接用于诊断。应对这一挑战的传统方法是对LLM进行昂贵的微调。另一种更具吸引力的解决方案是开发一种插件,使LLM能够执行医疗对话任务。受上下文学习启发,我们提出PlugMed——一种即插即用的医疗对话系统,通过两个模块(提示生成模块和响应排序模块)促进LLM采取适当的对话行为。提示生成模块旨在从全局和局部两个视角捕捉对话信息,分别通过评估提示与完整对话历史以及按患者症状分组的最近话语的相似性来选择合适的提示。此外,响应排序模块引入微调的小型语言模型(SLM)作为响应过滤器,筛选LLM生成的恰当回答。同时,我们设计了一种基于意图和医疗实体匹配的新型评估方法,以更有效地衡量医疗对话中对话策略的效果。在三个未标注医疗对话数据集上进行的实验评估(包括自动评估和人工评估)表明,我们的模型超越了强大的微调基线方法。