Users typically engage with LLMs interactively, yet most existing benchmarks evaluate them in a static, single-turn format, posing reliability concerns in interactive scenarios. We identify a key obstacle towards reliability: LLMs are trained to answer any question, even with incomplete context or insufficient knowledge. In this paper, we propose to change the static paradigm to an interactive one, develop systems that proactively ask questions to gather more information and respond reliably, and introduce an benchmark - MediQ - to evaluate question-asking ability in LLMs. MediQ simulates clinical interactions consisting of a Patient System and an adaptive Expert System; with potentially incomplete initial information, the Expert refrains from making diagnostic decisions when unconfident, and instead elicits missing details via follow-up questions. We provide a pipeline to convert single-turn medical benchmarks into an interactive format. Our results show that directly prompting state-of-the-art LLMs to ask questions degrades performance, indicating that adapting LLMs to proactive information-seeking settings is nontrivial. We experiment with abstention strategies to better estimate model confidence and decide when to ask questions, improving diagnostic accuracy by 22.3%; however, performance still lags compared to an (unrealistic in practice) upper bound with complete information upfront. Further analyses show improved interactive performance with filtering irrelevant contexts and reformatting conversations. Overall, we introduce a novel problem towards LLM reliability, an interactive MediQ benchmark and a novel question-asking system, and highlight directions to extend LLMs' information-seeking abilities in critical domains.
翻译:用户通常以交互方式与大语言模型(LLMs)互动,然而现有基准测试大多采用静态单轮评估模式,在交互场景中存在可靠性隐患。我们发现实现可靠性的一个关键障碍:LLMs被训练为回答所有问题,即使在语境不完整或知识不足的情况下。本文提出将静态范式转变为交互范式,开发能够主动提问以收集更多信息并作出可靠响应的系统,并引入MediQ基准测试来评估LLMs的提问能力。MediQ模拟由患者系统和自适应专家系统构成的临床交互场景:在初始信息可能不完整的情况下,专家系统在置信度不足时会避免做出诊断决策,转而通过追问获取缺失细节。我们提供将单轮医学基准转化为交互格式的流程框架。实验结果表明,直接要求当前最优LLMs进行提问会降低其性能,这证明使LLMs适应主动信息获取场景具有挑战性。我们尝试通过弃权策略来更准确估计模型置信度并决定何时提问,将诊断准确率提升22.3%;但与(实践中不现实的)预先获得完整信息的理想上限相比仍存在差距。进一步分析表明,通过过滤无关语境和重构对话格式可提升交互性能。总体而言,我们针对LLM可靠性提出了新颖的交互问题范式,建立了MediQ交互基准与新型提问系统,并指明了在关键领域扩展LLMs信息获取能力的研究方向。