Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high-level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation. Our method decomposes complex tasks into sub-actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high-precision spatial-semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task-oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real-time feedback, generating precise action plans and facilitating step-wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.
翻译:视觉语言模型在机器人操作中展现出巨大潜力,但在高速、高精度地执行复杂精细操作任务方面仍面临挑战。现有VLM方法虽擅长高层规划,却难以引导机器人完成精确的精细动作序列。为突破此局限,我们提出一种渐进式VLM规划算法,使机器人能够执行快速、精确且可纠错的精细操作。该方法将复杂任务分解为子动作,并维护三个关键数据结构:任务记忆结构、二维拓扑图与三维空间网络,实现了高精度的空间语义融合。这三个组件在任务执行过程中共同积累和存储关键信息,为我们面向任务的VLM交互机制提供丰富上下文。这使得VLM能够基于实时反馈动态调整引导,生成精确的动作计划并实现逐步误差修正。在复杂装配任务上的实验验证表明,我们的算法能有效引导机器人在挑战性场景中快速、精确地完成精细操作,显著提升了机器人在精密任务中的智能水平。