Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.

翻译：大语言模型（LLMs）在回答明确提出的问题时表现有效。然而，面对模糊查询时，它们可能表现得不可预测并产生错误输出。这凸显了开发能够有效提出澄清问题以解决歧义的智能代理的必要性。这种能力需要在多轮对话中实现复杂的理解、状态追踪、推理和规划。然而，直接衡量这一点颇具挑战性。本文提出一个替代问题，用于评估大语言模型通过向裁判提出一系列查询，推理出自身未知但已告知裁判的实体的能力。这种"实体推理游戏"可作为评估框架，探究语言模型的对话推理和规划能力。我们系统评估了多种大语言模型，发现它们在此任务上存在显著性能差异。研究表明，像GPT-4这样的强模型在性能上大幅超越人类玩家。我们进一步采用行为克隆（BC）方法，考察弱模型是否能够仅通过强模型的演示来模仿其行为，并泛化到不同数据或领域。最后，我们提出使用强化学习，通过多轮游戏竞争来增强Vicuna模型的推理与规划能力，这带来了显著的性能提升。希望这个问题能为在模糊环境中训练更智能的自主代理提供启示。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日