基于视觉语言模型的灵巧操作支架 (Scaffolding Dexterous Manipulation with Vision-Language Models)

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.

翻译：灵巧机器人手对于执行复杂操作任务至关重要，但由于演示数据收集和高维控制的挑战，其训练仍然困难。虽然强化学习（RL）可以通过在仿真中生成经验来缓解数据瓶颈，但它通常依赖于精心设计的、任务特定的奖励函数，这阻碍了可扩展性和泛化能力。因此，当前灵巧操作的研究工作常从参考轨迹引导起步。这些轨迹指定了目标手部姿态以指导RL策略的探索，以及物体姿态以实现密集的、任务无关的奖励。然而，获取合适的轨迹——特别是对于灵巧手——仍然是一个重大挑战。实际上，显式参考轨迹中的精确细节往往并非必需，因为RL最终会优化运动。我们的核心见解是：现代视觉语言模型（VLMs）已经编码了指定任务和有效引导探索所需的常识性空间与语义知识。给定任务描述（例如“打开柜门”）和视觉场景，我们的方法使用现成的VLM首先识别任务相关的关键点（例如把手、按钮），然后合成手部运动和物体运动的3D轨迹。随后，我们在仿真中训练一个低层级的残差RL策略，以高保真度跟踪这些粗略轨迹或“支架”。在一系列涉及铰接物体和语义理解的仿真任务中，我们证明了我们的方法能够学习到鲁棒的灵巧操作策略。此外，我们还展示了该方法无需任何人类演示或手工设计奖励，即可迁移到现实世界的机器人手上。