Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs.
翻译:大语言模型(LLMs)在语言理解和多种推理任务中展现了令人印象深刻的性能。然而,它们在空间推理这一人类认知关键方面的能力仍相对未被充分探索。人类拥有一种非凡的能力,能够通过一种被称为“思维之眼”的过程,对未见过的物体和动作形成心理图像,从而想象出未见的世界。受此认知能力的启发,我们提出了可视化思维(VoT)提示方法。VoT旨在通过可视化大语言模型的推理轨迹来激发其空间推理能力,从而指导后续的推理步骤。我们将VoT应用于多跳空间推理任务,包括在二维网格世界中的自然语言导航、视觉导航和视觉平铺。实验结果表明,VoT显著增强了大语言模型的空间推理能力。值得注意的是,在这些任务中,VoT的表现优于现有的多模态大语言模型(MLLMs)。虽然VoT在大语言模型上效果出奇地好,但其通过生成心理图像来促进空间推理的能力类似于思维之眼过程,这表明它在多模态大语言模型中也可能具有潜在的可行性。