With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can generate highly expressive future visual planning goals. Simultaneously, we evaluate perturbations by simulating responses, which are called internal embeddings and optimized through prototype contrastive learning. This allows the model to implicitly infer and distinguish perturbations from the external environment. The proposed GEVRM achieves state-of-the-art performance on both standard and perturbed CALVIN benchmarks and shows significant improvements in realistic robot tasks.
翻译:随着具身人工智能的快速发展,面向通用机器人决策的视觉-语言-动作(VLA)模型已取得显著进展。然而,现有的大多数VLA模型未能考虑部署过程中不可避免的外部扰动。这些扰动为VLA引入了不可预见的状态信息,导致其产生不准确的动作,进而造成泛化性能的显著下降。经典的内模控制(IMC)原理表明,一个包含外部输入信号内部模型的闭环系统能够精确跟踪参考输入并有效抵消扰动。我们提出了一种新颖的闭环VLA方法GEVRM,该方法集成了IMC原理以增强机器人视觉操控的鲁棒性。GEVRM中的文本引导视频生成模型能够生成高度表达性的未来视觉规划目标。同时,我们通过模拟响应来评估扰动,这些响应被称为内部嵌入,并通过原型对比学习进行优化。这使得模型能够隐式推断并区分来自外部环境的扰动。所提出的GEVRM在标准及扰动CALVIN基准测试上均取得了最先进的性能,并在现实机器人任务中展现出显著改进。