This study explores the potential of off-the-shelf Vision-Language Models (VLMs) for high-level robot planning in the context of autonomous navigation. Indeed, while most of existing learning-based approaches for path planning require extensive task-specific training/fine-tuning, we demonstrate how such training can be avoided for most practical cases. To do this, we introduce Select2Plan (S2P), a novel training-free framework for high-level robot planning which completely eliminates the need for fine-tuning or specialised training. By leveraging structured Visual Question-Answering (VQA) and In-Context Learning (ICL), our approach drastically reduces the need for data collection, requiring a fraction of the task-specific data typically used by trained models, or even relying only on online data. Our method facilitates the effective use of a generally trained VLM in a flexible and cost-efficient way, and does not require additional sensing except for a simple monocular camera. We demonstrate its adaptability across various scene types, context sources, and sensing setups. We evaluate our approach in two distinct scenarios: traditional First-Person View (FPV) and infrastructure-driven Third-Person View (TPV) navigation, demonstrating the flexibility and simplicity of our method. Our technique significantly enhances the navigational capabilities of a baseline VLM of approximately 50% in TPV scenario, and is comparable to trained models in the FPV one, with as few as 20 demonstrations.
翻译:本研究探索了现成的视觉语言模型(VLM)在自主导航背景下用于高层机器人规划的潜力。事实上,虽然现有大多数基于学习的路径规划方法需要大量针对特定任务的训练/微调,但我们证明了在多数实际场景中可以避免此类训练。为此,我们提出了Select2Plan(S2P),一种新颖的免训练高层机器人规划框架,完全无需微调或专门训练。通过利用结构化视觉问答(VQA)与上下文学习(ICL),我们的方法大幅降低了数据收集需求,仅需训练模型通常使用的少量任务特定数据,甚至可仅依赖在线数据。该方法以灵活且经济高效的方式促进了通用训练VLM的有效使用,且除简单单目相机外无需额外传感设备。我们展示了其在多种场景类型、上下文来源及传感配置中的适应性。我们在两种不同场景中评估了该方法:传统第一人称视角(FPV)导航与基础设施驱动的第三人称视角(TPV)导航,证明了本方法的灵活性与简洁性。我们的技术显著提升了基线VLM的导航能力——在TPV场景中提升约50%,在FPV场景中仅需20个演示样本即达到与训练模型相当的性能。