Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed $\textbf{P}$ath $\textbf{P}$lanning from $\textbf{N}$atural $\textbf{L}$anguage ($\textbf{PPNL}$). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating ''path planning'' tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies and BART and T5 of various sizes via fine-tuning. Our experimental results show the promise of few-shot GPT-4 in spatial reasoning, when it is prompted to reason and act interleavedly, although it still fails to make long-term temporal reasoning. In contrast, while fine-tuned LLMs achieved impressive results on in-distribution reasoning tasks, they struggled to generalize to larger environments or environments with more obstacles.
翻译:大型语言模型(LLMs)在广泛任务中取得了显著成功,但在需要长期规划与空间推理的场景中仍存在局限。为促进这一研究方向,本文提出了一项新基准——**基于自然语言的路径规划**($\textbf{Path Planning from Natural Language}$,$\textbf{PPNL}$)。该基准通过设计“路径规划”任务评估LLMs的时空推理能力,要求模型在规避障碍物并遵守约束条件的前提下导航至目标位置。借助此基准,我们系统研究了不同少样本提示方法下的GPT-4,以及经过微调的不同规模的BART与T5模型。实验结果表明,当采用交替推理与行动的提示策略时,GPT-4在空间推理方面展现出一定潜力,但依然无法实现长期时序推理。相比之下,虽然微调后的LLMs在分布内推理任务上表现优异,却难以泛化至更大规模或包含更多障碍物的环境。