In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning (NLP) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World.
翻译:本文首次系统研究大语言模型在复杂规划任务中的表现——这类任务要求模型理解通过自然语言模拟的虚拟空间环境,并据此在文本环境中进行相应操作。我们提出了一个名为自然语言规划(NLP)的基准测试集,包含三类新型任务:积木世界、基于NLVR的操作以及自然语言导航。实验发现,当前主流大语言模型(如ChatGPT)在复杂规划中仍能力不足。这引发了一个问题:大语言模型是否真正理解以自然语言描述的环境?抑或符号表示等替代方案更为简洁,因而更易被模型理解?为此,我们提出名为CoS(链式符号提示)的新方法,该方法在链式中间推理步骤中,采用紧凑的符号化空间表示来表征复杂环境。CoS易于使用且无需对大语言模型进行额外训练。大量实验表明,在ChatGPT和InstructGPT上,CoS在所有三项规划任务中的表现均显著优于思维链提示(CoT),且输入使用的令牌数更少。以ChatGPT在积木世界任务中的表现为例,CoS带来的性能提升最高达60.8%(准确率从31.8%提升至92.6%)。同时,在积木世界演示的中间推理步骤中,CoS将提示输入令牌数从407个压缩至139个,降幅高达65.8%。