Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks. In safety-critical applications such as healthcare, the utility of these models is governed by their ability to generate outputs that are factually accurate and complete. In this work, we present dialog-enabled resolving agents (DERA). DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT-4. It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output. We frame our dialog as a discussion between two agent types - a Researcher, who processes information and identifies crucial problem components, and a Decider, who has the autonomy to integrate the Researcher's information and makes judgments on the final output. We test DERA against three clinically-focused tasks. For medical conversation summarization and care plan generation, DERA shows significant improvement over the base GPT-4 performance in both human expert preference evaluations and quantitative metrics. In a new finding, we also show that GPT-4's performance (70%) on an open-ended version of the MedQA question-answering (QA) dataset (Jin et al. 2021, USMLE) is well above the passing level (60%), with DERA showing similar performance. We release the open-ended MEDQA dataset at https://github.com/curai/curai-research/tree/main/DERA.
翻译:大型语言模型已成为许多自然语言理解任务的重要工具。在医疗健康等安全关键型应用中,这些模型的实用性取决于其生成事实准确且完整输出的能力。本研究提出基于对话的解析智能体(DERA)。DERA是一种利用大型语言模型(尤其是GPT-4)增强对话能力的新型范式,它为模型提供了一种简单且可解释的交互框架,使模型能够传递反馈并迭代优化输出。我们将对话设计为两类智能体间的讨论:研究者(Researcher)负责处理信息并识别关键问题要素,决策者(Decider)则能自主整合研究者提供的信息,对最终输出做出判断。我们在三项临床任务中测试DERA:在医疗对话摘要与护理方案生成任务中,DERA在人类专家偏好评估和量化指标上均显著超越基线GPT-4性能。此外,我们发现GPT-4在开放式MedQA问答数据集(Jin等,2021,USMLE)上的准确率达70%(远超60%的通过线),而DERA展现了相近性能。我们在https://github.com/curai/curai-research/tree/main/DERA 公开发布了开放式MEDQA数据集。