Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning and Action (Natala) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World. Code and data available at: https://github.com/hanxuhu/chain-of-symbol-planning

翻译：本文率先探究大型语言模型在复杂规划任务上的表现，此类任务要求LLM理解通过自然语言模拟的虚拟空间环境，并据此在文本中采取相应行动。我们提出了一个名为自然语言规划与行动（Natala）的基准测试，包含一系列新颖任务：积木世界、基于NLVR的操作以及自然语言导航。研究发现，当前流行的LLM（如ChatGPT）在复杂规划方面仍存在能力不足。这引发了一个问题——LLM是否真正理解自然语言描述的环境？或许符号表示等替代方案更为简洁，因此更易被LLM理解？为此，我们提出了一种名为CoS（符号链提示）的新方法，在链式中间思考步骤中用精简的符号空间表示来表征复杂环境。CoS易于使用，无需对LLM进行额外训练。大量实验表明，在全部三项规划任务中，CoS的性能均明显超越思维链提示（CoT），且在ChatGPT和InstructGPT上使用的输入token数甚至少于CoT。性能提升显著，以ChatGPT在积木世界任务为例，准确率从31.8%提升至92.6%，提升幅度达60.8%。同时，CoS还显著减少了提示中的token数量，在积木世界示例的中间步骤中，token数从407降至139，缩减率达65.8%。代码与数据见：https://github.com/hanxuhu/chain-of-symbol-planning