The fundamental premise of Vision-Language-Action (VLA) models is to harness the extensive general capabilities of pre-trained Vision-Language Models (VLMs) for generalized embodied intelligence. However, standard robotic fine-tuning inevitably disrupts the pre-trained feature space, leading to "catastrophic forgetting" that compromises the general visual understanding we aim to leverage. To effectively utilize the uncorrupted general capabilities of VLMs for robotic tasks, we propose TwinBrainVLA, which coordinates two isomorphic VLM pathways: a frozen generalist (also called "Left Brain") and a trainable specialist (also called "Right Brain"). Our architecture utilizes a Asymmetric Mixture-of-Transformers (AsyMoT) mechanism, enabling the Right Brain to dynamically query and fuse intact semantic knowledge from the Left Brain with proprioceptive states. This fused representation conditions a flow-matching action expert for precise continuous control. Empirical results on SimplerEnv and RoboCasa benchmarks demonstrate that by explicitly retaining general capabilities, TwinBrainVLA achieves substantial performance gains over baseline models in complex manipulation tasks.
翻译:视觉-语言-动作模型的基本前提是利用预训练视觉-语言模型广泛的通用能力,实现泛化的具身智能。然而,标准的机器人微调不可避免地会破坏预训练的特征空间,导致"灾难性遗忘",从而损害我们本欲利用的通用视觉理解能力。为了有效利用VLM未受损的通用能力完成机器人任务,我们提出了TwinBrainVLA,它协调两个同构的VLM通路:一个冻结的通用通路(亦称"左脑")和一个可训练的专业通路(亦称"右脑")。我们的架构采用非对称混合Transformer机制,使右脑能够动态查询左脑完整的语义知识,并将其与本体感知状态融合。这种融合表征用于调节一个流匹配动作专家,以实现精确的连续控制。在SimplerEnv和RoboCasa基准测试上的实证结果表明,通过显式保留通用能力,TwinBrainVLA在复杂操作任务中相比基线模型实现了显著的性能提升。