In high-stakes domains like clinical reasoning, AI assistants powered by large language models (LLMs) are yet to be reliable and safe. We identify a key obstacle towards reliability: existing LLMs are trained to answer any question, even with incomplete context in the prompt or insufficient parametric knowledge. We propose to change this paradigm to develop more careful LLMs that ask follow-up questions to gather necessary and sufficient information and respond reliably. We introduce MEDIQ, a framework to simulate realistic clinical interactions, which incorporates a Patient System and an adaptive Expert System. The Patient may provide incomplete information in the beginning; the Expert refrains from making diagnostic decisions when unconfident, and instead elicits missing details from the Patient via follow-up questions. To evaluate MEDIQ, we convert MEDQA and CRAFT-MD -- medical benchmarks for diagnostic question answering -- into an interactive setup. We develop a reliable Patient system and prototype several Expert systems, first showing that directly prompting state-of-the-art LLMs to ask questions degrades the quality of clinical reasoning, indicating that adapting LLMs to interactive information-seeking settings is nontrivial. We then augment the Expert with a novel abstention module to better estimate model confidence and decide whether to ask more questions, thereby improving diagnostic accuracy by 20.3%; however, performance still lags compared to an (unrealistic in practice) upper bound when full information is given upfront. Further analyses reveal that interactive performance can be improved by filtering irrelevant contexts and reformatting conversations. Overall, our paper introduces a novel problem towards LLM reliability, a novel MEDIQ framework, and highlights important future directions to extend the information-seeking abilities of LLM assistants in critical domains.
翻译:在临床推理等高风险领域,基于大语言模型(LLMs)的AI助手尚未实现可靠与安全。我们发现可靠性的一个关键障碍:现有LLMs被训练为回答任何问题,即使提示中上下文不完整或参数知识不足。我们提出改变这一范式,开发更审慎的LLMs——通过追问来收集必要且充分的信息,从而做出可靠响应。本文介绍MEDIQ框架,该框架模拟真实临床交互,包含患者系统与自适应专家系统。患者初始可能提供不完整信息;专家在置信度不足时避免做出诊断决策,转而通过追问从患者处获取缺失细节。为评估MEDIQ,我们将医疗诊断问答基准MEDQA和CRAFT-MD转化为交互式设置。我们构建了可靠的患者系统并原型化多个专家系统,首先证明直接提示最先进的LLMs进行提问会降低临床推理质量,表明使LLMs适应交互式信息检索场景具有挑战性。随后我们为专家系统增设新型弃权模块,以更准确估计模型置信度并决定是否继续提问,从而将诊断准确率提升20.3%;但与预先提供完整信息的(实践中不现实的)上限相比,性能仍有差距。进一步分析表明,通过过滤无关上下文和重构对话格式可提升交互性能。总体而言,本文提出了LLM可靠性的新问题,构建了创新的MEDIQ框架,并指明了在关键领域扩展LLM助手信息检索能力的重要未来方向。