Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA
翻译:视觉-语言-动作(VLA)模型已成为通用机器人操作领域一种有前景的范式,其通过大规模预训练实现了强大的性能。该领域发展迅速,引入了额外的空间先验和多样化的架构创新。然而,这些进展往往伴随着不同的训练方案和实现细节,这使得准确厘清经验性性能提升的具体来源变得颇具挑战。在本工作中,我们提出了SimVLA,一个旨在为VLA研究建立透明参考基准的简化基线模型。通过严格地将感知与控制解耦、采用标准的视觉-语言主干网络和轻量级动作头,并对关键训练动态进行标准化,我们证明了一个极简的设计能够达到最先进的性能。尽管仅拥有5亿参数,SimVLA在标准仿真基准测试中超越了数十亿参数规模的模型,且无需机器人预训练。与pi0.5相比,SimVLA在真实机器人上也达到了相当的性能。我们的研究结果确立了SimVLA作为一个稳健、可复现的基线,能够将未来架构创新带来的经验性增益进行清晰归因。项目网站:https://frontierrobo.github.io/SimVLA