MoTVLA：一种融合统一快慢推理的视觉-语言-行为模型 (MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning)

Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated.In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA) model that integrates fast-slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate domain-specific fast reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both fast-slow reasoning and manipulation task performance.

翻译：将视觉-语言指令整合到视觉运动策略中，正日益成为机器人学习领域增强开放世界泛化能力的重要方向。尽管现有方法取得了有希望的进展，但仍面临两大挑战：在不使用生成推理作为条件时语言可操控性有限，或在引入推理时推理延迟显著增加。本工作提出了MoTVLA，一种基于混合Transformer（MoT）的视觉-语言-行为（VLA）模型，它将快慢统一推理与行为策略学习相融合。MoTVLA保留了预训练视觉语言模型（作为通用专家）在感知、场景理解和语义规划等任务上的通用智能，同时引入了一个领域专家——即与预训练视觉语言模型共享知识的第二个Transformer，用于生成领域特定的快速推理（例如机器人运动分解），从而提升策略执行效率。通过将动作专家的条件设定为分解后的运动指令，MoTVLA能够学习多样化行为，并显著提升语言可操控性。在自然语言处理基准测试、机器人仿真环境以及真实世界实验中进行的大量评估证实，MoTVLA在快慢推理和操作任务性能方面均具有优越性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/