Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.
翻译:大语言模型(LLMs)在回答明确提出的问题时表现有效。然而,面对模糊查询时,它们可能表现得不可预测并产生错误输出。这凸显了开发能够有效提出澄清问题以解决歧义的智能代理的必要性。这种能力需要在多轮对话中实现复杂的理解、状态追踪、推理和规划。然而,直接衡量这一点颇具挑战性。本文提出一个替代问题,用于评估大语言模型通过向裁判提出一系列查询,推理出自身未知但已告知裁判的实体的能力。这种"实体推理游戏"可作为评估框架,探究语言模型的对话推理和规划能力。我们系统评估了多种大语言模型,发现它们在此任务上存在显著性能差异。研究表明,像GPT-4这样的强模型在性能上大幅超越人类玩家。我们进一步采用行为克隆(BC)方法,考察弱模型是否能够仅通过强模型的演示来模仿其行为,并泛化到不同数据或领域。最后,我们提出使用强化学习,通过多轮游戏竞争来增强Vicuna模型的推理与规划能力,这带来了显著的性能提升。希望这个问题能为在模糊环境中训练更智能的自主代理提供启示。