Despite significant progress in robotics and embodied AI in recent years, deploying robots for long-horizon tasks remains a great challenge. Majority of prior arts adhere to an open-loop philosophy and lack real-time feedback, leading to error accumulation and undesirable robustness. A handful of approaches have endeavored to establish feedback mechanisms leveraging pixel-level differences or pre-trained visual representations, yet their efficacy and adaptability have been found to be constrained. Inspired by classic closed-loop control systems, we propose CLOVER, a closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. CLOVER consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions from feedback and initiates replans as needed. Our framework exhibits notable advancement in real-world robotic tasks and achieves state-of-the-art on CALVIN benchmark, improving by 8% over previous open-loop counterparts. Code and checkpoints are maintained at https://github.com/OpenDriveLab/CLOVER.
翻译:尽管近年来机器人技术和具身人工智能取得了显著进展,但将机器人部署于长时程任务仍面临巨大挑战。现有方法大多遵循开环设计理念,缺乏实时反馈机制,导致误差累积和鲁棒性不足。少数研究尝试利用像素级差异或预训练视觉表征建立反馈机制,但其有效性和适应性仍受限制。受经典闭环控制系统启发,我们提出了CLOVER——一种融入反馈机制以提升自适应机器人控制能力的闭环视觉运动控制框架。CLOVER包含三个核心组件:用于生成视觉规划作为参考输入的文本条件视频扩散模型、实现精确误差度量的可测嵌入空间,以及能够根据反馈优化动作并在必要时触发重规划的反馈驱动控制器。该框架在真实世界机器人任务中展现出显著优势,在CALVIN基准测试中达到最先进水平,较先前开环方法提升8%。代码与模型检查点维护于https://github.com/OpenDriveLab/CLOVER。