Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks .
翻译:近年来,大型语言模型的进展与大规模机器人数据集的获取,共同推动了机器人模型的范式转变,使其转变为能够适应多种任务、场景与机器人形态的通用系统。开放视觉语言动作模型作为该领域的重要进展,已在广泛任务中展现出强大性能。本研究考察了三种现有机器人基础模型的视觉泛化能力,并提出了相应的评估框架。我们的研究表明,现有模型在面对视觉领域外场景时缺乏鲁棒性。这可能是由于训练数据变化有限和/或灾难性遗忘所导致,致使视觉基础模型存在领域局限性。我们进一步探究了采用两个预训练视觉基础模型的OpenVLA——理论上应能泛化至领域外实验,但通过其深度回归任务的失败案例,揭示了DINO-v2在OpenVLA中出现的灾难性遗忘现象。为克服上述视觉灾难性遗忘问题,我们提出一种基于模型融合的渐进式骨干网络逆转方法。该方法使需要在初始训练阶段适配视觉骨干的OpenVLA能够重新获得视觉泛化能力。这种能力的恢复使我们的ReVLA模型在视觉领域外任务中,抓取和抬升性能分别较OpenVLA提升77%与66%。