Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 93.0% to 96.2%, and in the $ABC \to D$ setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview
翻译:利用视觉-语言模型进行机器人操作代表了一种新颖的范式,旨在提升模型对新物体与指令的泛化能力。然而,由于相机规格与安装位置的差异,现有方法在不同机器人平台上表现出显著的性能差异。为应对这一挑战,本文提出RoboUniView——一种将视觉特征提取与动作学习解耦的创新方法。我们首先通过基于易获取数据的预训练,从多视角图像中学习统一的视角表征,随后基于该统一表征推导动作以控制机器人操作。这种统一视角表征能更精确地反映物理世界,且不受机器人平台相机参数的限制。得益于该方法,我们在要求严苛的CALVIN基准测试中取得了最先进的性能:将$D \to D$设定下的成功率从93.0%提升至96.2%,在$ABC \to D$设定下从92.2%提升至94.2%。此外,我们的模型展现出卓越的适应性与灵活性:在未见过的相机参数下仍保持高性能,能够利用具有不同相机参数的多源数据集,并能进行跨数据集的联合跨任务学习。本文提供代码以供复现。https://github.com/liufanfanlff/RoboUniview