Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning and Action (Natala) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World. Code and data available at: https://github.com/hanxuhu/chain-of-symbol-planning

翻译：本文率先研究大型语言模型在复杂规划任务中的表现，这些任务要求模型通过自然语言模拟的虚拟空间环境进行理解并作出相应文本行动。我们提出名为自然语言规划与行动（Natala）的基准测试，包含三类新颖任务：积木世界、基于NLVR的操作以及自然语言导航。研究发现，当前主流大型语言模型（如ChatGPT）仍缺乏处理复杂规划的能力。这引发了一个问题——大型语言模型是否能良好理解自然语言描述的环境，抑或符号表示等替代方案更为简洁且更易被模型理解？为此，我们提出了一种名为链式符号提示（CoS）的新方法，在链式中间推理步骤中使用压缩的符号空间表示来表征复杂环境。CoS易于使用，无需对大型语言模型进行额外训练。大量实验表明，在ChatGPT和InstructGPT上，CoS在三个规划任务中均显著超越链式思维提示（CoT）的表现，且输入使用的token数更少。性能提升显著，以ChatGPT在积木世界任务为例，准确率从31.8%提升至92.6%（提升幅度达60.8%）。CoS还明显减少了提示中的token数量，在积木世界演示的中间步骤中，token数从407降至139（缩减幅度达65.8%）。代码与数据见：https://github.com/hanxuhu/chain-of-symbol-planning