Integrating visual-language instructions into visuomotor policies is gaining momentum in robot learning for enhancing open-world generalization. Despite promising advances, existing approaches face two challenges: limited language steerability when no generated reasoning is used as a condition, or significant inference latency when reasoning is incorporated.In this work, we introduce MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA) model that integrates fast-slow unified reasoning with behavior policy learning. MoTVLA preserves the general intelligence of pre-trained VLMs (serving as the generalist) for tasks such as perception, scene understanding, and semantic planning, while incorporating a domain expert, a second transformer that shares knowledge with the pretrained VLM, to generate domain-specific fast reasoning (e.g., robot motion decomposition), thereby improving policy execution efficiency. By conditioning the action expert on decomposed motion instructions, MoTVLA can learn diverse behaviors and substantially improve language steerability. Extensive evaluations across natural language processing benchmarks, robotic simulation environments, and real-world experiments confirm the superiority of MoTVLA in both fast-slow reasoning and manipulation task performance.
翻译:将视觉-语言指令整合到视觉运动策略中,正日益成为机器人学习领域增强开放世界泛化能力的重要方向。尽管现有方法取得了有希望的进展,但仍面临两大挑战:在不使用生成推理作为条件时语言可操控性有限,或在引入推理时推理延迟显著增加。本工作提出了MoTVLA,一种基于混合Transformer(MoT)的视觉-语言-行为(VLA)模型,它将快慢统一推理与行为策略学习相融合。MoTVLA保留了预训练视觉语言模型(作为通用专家)在感知、场景理解和语义规划等任务上的通用智能,同时引入了一个领域专家——即与预训练视觉语言模型共享知识的第二个Transformer,用于生成领域特定的快速推理(例如机器人运动分解),从而提升策略执行效率。通过将动作专家的条件设定为分解后的运动指令,MoTVLA能够学习多样化行为,并显著提升语言可操控性。在自然语言处理基准测试、机器人仿真环境以及真实世界实验中进行的大量评估证实,MoTVLA在快慢推理和操作任务性能方面均具有优越性。