The acquisition of manipulation skills through language instruction remains an unresolved challenge. Recently, vision-language models have made significant progress in teaching robots these skills. However, their performance is restricted to a narrow range of simple tasks. In this paper, we propose that vision-language models can provide a superior source of rewards for agents. Our method decomposes complex tasks into simpler sub-goals, enabling better task comprehension and avoiding potential failures with sparse failure guidance. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a $5.4\times$ higher average success rate compared to the best baseline, RoboCLIP, across a series of manipulation tasks. It has shown a comprehensive understanding of a wide range of robotic manipulation tasks.
翻译:通过语言指令获取操作技能仍是一个未解决的挑战。近期,视觉语言模型在教导机器人这些技能方面取得了显著进展。然而,其性能仅限于一狭窄范围的简单任务。本文提出,视觉语言模型可以为智能体提供更优的奖励来源。我们的方法将复杂任务分解为更简单的子目标,通过稀疏失败引导实现更好的任务理解并避免潜在失败。实验证据表明,我们的算法在CLIP、LIV和RoboCLIP等基线方法上持续表现出优越性。具体而言,在一系列操作任务中,相较于最佳基线RoboCLIP,我们的算法实现了平均成功率$5.4\times$的提升。该算法展现了对广泛机器人操作任务的全面理解。