Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.
翻译:视觉语言模型在场景理解与感知任务中取得了显著成功,使机器人能够在动态环境中自适应地规划与执行动作。然而,大多数多模态大语言模型缺乏鲁棒的3D场景定位能力,限制了其在细粒度机器人操作中的有效性。此外,识别准确率低、效率不足、可迁移性差及可靠性问题等挑战阻碍了其在精密任务中的应用。为应对这些局限,本文提出一种新颖框架,该框架通过将2D图像映射至点云来集成2D提示合成模块,并引入小型语言模型以监督视觉语言模型的输出。2D提示合成模块使得基于2D图像与文本训练的视觉语言模型能够自主提取精确的3D空间信息而无需人工干预,显著增强了3D场景理解能力。同时,小型语言模型对视觉语言模型输出进行监督,有效缓解幻觉现象并确保生成可靠、可执行的机器人控制代码。本框架无需在新环境中重新训练,从而提升了成本效益与操作鲁棒性。实验结果表明,所提框架实现了96.0%的任务成功率,性能优于其他方法。消融研究证实了2D提示合成模块与输出监督模块的关键作用(移除任一模块会导致任务成功率下降67%)。这些发现验证了该框架在提升3D识别、任务规划及机器人任务执行方面的有效性。