Large language models (LLMs) are currently effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.
翻译:大型语言模型在回答明确问题时表现出色,但当面对模糊查询时可能行为失当并产生错误输出。这凸显了开发具备有效澄清问题能力的智能体的必要性——这类能力需要在多轮对话中实现复杂理解、状态追踪、推理与规划。然而直接评估这一能力具有挑战性。本文提出了一种替代性问题:通过让大语言模型向裁判提出系列问题,逐步推演出自身未知但裁判已知的实体。这种"实体推理游戏"可作为评估框架,探测语言模型的对话推理与规划能力。我们对多种大语言模型进行系统评估,发现其在该任务上存在显著性能差异。研究表明,如GPT-4等强模型以较大优势超越人类玩家。我们进一步采用行为克隆方法,验证弱模型能否仅通过模仿强模型的行为示范来学习并泛化至新数据或领域。最后提出利用强化学习,通过多轮游戏对弈提升Vicuna模型的推理与规划能力,实验表明该方法带来显著性能提升。期望本研究能为自主智能体在模糊情境下实现更智能行为的训练提供启发。