Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
翻译:近年来,大型语言模型(LLMs)及其多模态扩展(MLLMs)的进展显著提升了机器在多样化任务中的推理能力。然而,这些模型主要依赖纯文本作为表达和构建推理的媒介,即使在视觉信息存在的情况下也是如此。在这项工作中,我们认为语言可能并非总是最自然或最有效的推理模态,尤其是在涉及空间与几何信息的任务中。受此启发,我们提出了一种新的范式——视觉规划,作为基于语言推理的补充渠道,使这些“视觉优先”任务能够通过纯视觉表征进行规划。在这一范式中,规划通过一系列图像序列执行,这些图像在视觉域中编码了逐步推理过程,类似于人类绘制草图或可视化未来行动的方式。我们引入了一种新颖的强化学习框架——基于强化学习的视觉规划(VPRL),该框架利用GRPO对大型视觉模型进行后训练,从而在一系列代表性视觉导航任务(FrozenLake、Maze和MiniBehavior)的规划中实现了显著改进。我们的视觉规划范式在纯文本空间进行推理的所有其他规划变体上均表现更优。我们的研究结果确立了视觉规划作为基于语言推理的一种可行且有前景的补充,为受益于直观、基于图像推理的任务开辟了新途径。