CogACT：一种融合认知与操作的机器人操控基础视觉-语言-动作模型 (CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation)

Qixiu Li,Yaobo Liang,Zeyu Wang,Lin Luo,Xi Chen,Mozheng Liao,Fangyun Wei,Yu Deng,Sicheng Xu,Yizhong Zhang,Xiaofan Wang,Bei Liu,Jianlong Fu,Jianmin Bao,Dong Chen,Yuanchun Shi,Jiaolong Yang,Baining Guo

from arxiv, Project Webpage: https://cogact.github.io/

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

翻译：大型视觉-语言-动作（VLA）模型的发展显著提升了机器人操控在语言引导任务执行及对未见场景泛化方面的能力。尽管现有基于预训练大型视觉-语言模型（VLM）改造的VLA已展现出良好的泛化潜力，但其任务执行性能仍不尽如人意，这体现在不同环境中较低的任务成功率上。本文提出一种基于VLM衍生的新型先进VLA架构。与先前研究通过简单动作量化直接将VLM重用于动作预测不同，我们提出一种组件化VLA架构，其具备以VLM输出为条件的专用动作模块。我们系统研究了动作模块的设计，论证了采用扩散动作Transformer进行动作序列建模所带来的显著性能提升及其优越的缩放特性。同时开展了全面的实验与消融研究，以评估不同设计下模型的有效性。在仿真与真实场景中对5种机器人本体的评估表明，我们的模型不仅在任务性能上大幅超越现有VLA，还展现出对新机器人的卓越适应能力以及对未见物体与背景的显著泛化能力。在仿真评估中，其平均成功率较模型规模相近（70亿参数）的OpenVLA高出35%以上，在真实机器人实验中更高出55%。在仿真环境中，其绝对成功率较大规模RT-2-X模型（550亿参数）亦高出18%。代码与模型可通过项目页面（https://cogact.github.io/）获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/