The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).
翻译:大型视觉-语言-动作(VLA)模型的发展显著提升了机器人操控在语言引导任务执行及对未见场景泛化方面的能力。尽管现有基于预训练大型视觉-语言模型(VLM)改造的VLA已展现出良好的泛化潜力,但其任务执行性能仍不尽如人意,这体现在不同环境中较低的任务成功率上。本文提出一种基于VLM衍生的新型先进VLA架构。与先前研究通过简单动作量化直接将VLM重用于动作预测不同,我们提出一种组件化VLA架构,其具备以VLM输出为条件的专用动作模块。我们系统研究了动作模块的设计,论证了采用扩散动作Transformer进行动作序列建模所带来的显著性能提升及其优越的缩放特性。同时开展了全面的实验与消融研究,以评估不同设计下模型的有效性。在仿真与真实场景中对5种机器人本体的评估表明,我们的模型不仅在任务性能上大幅超越现有VLA,还展现出对新机器人的卓越适应能力以及对未见物体与背景的显著泛化能力。在仿真评估中,其平均成功率较模型规模相近(70亿参数)的OpenVLA高出35%以上,在真实机器人实验中更高出55%。在仿真环境中,其绝对成功率较大规模RT-2-X模型(550亿参数)亦高出18%。代码与模型可通过项目页面(https://cogact.github.io/)获取。