Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model $π_0$, which relies on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.
翻译:在大规模机器人数据集上训练的视觉-语言-动作模型在操作任务(包括双臂任务)上展现出了强大的性能。然而,由于大多数公开数据集侧重于单臂演示,将VLA模型适配于双臂任务通常需要大量额外的双臂数据并进行微调。为应对这一挑战,我们提出了TwinVLA,这是一个模块化框架,它将两个预训练的单臂VLA模型组合成一个协调的双臂VLA。与在单臂和双臂数据混合上训练的单一跨具身模型不同,TwinVLA通过组合预训练的单臂策略,同时提升了数据效率和性能。在真实世界和仿真环境中的多种双臂任务上,TwinVLA的性能优于规模相当的单一RDT-1B模型,且无需任何双臂预训练。此外,它缩小了与最先进模型$π_0$之间的差距,而后者依赖于大量专有的双臂数据和计算成本。这些结果表明,我们的模块化组合方法利用公开的单臂数据,为通向高性能双臂操作提供了一条数据高效且可扩展的路径。