Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.
翻译:视觉-语言-动作模型近期已成为构建通用型机器人智能体的一种具有前景的范式。然而,当前VLA领域仍高度碎片化且复杂:现有方法在架构、训练数据、具身配置及基准特定工程化方面存在显著差异。在本工作中,我们提出StarVLA-$α$,一个简单而强劲的基线模型,旨在受控条件下系统研究VLA设计选择。StarVLA-$α$有意最小化架构与流水线复杂度,以减少实验混淆因素并实现系统性分析。具体而言,我们重新评估了若干关键设计轴,包括动作建模策略、机器人特定预训练及接口工程。在LIBERO、SimplerEnv、RoboTwin与RoboCasa统一多基准训练中,同一简单基线仍保持高度竞争力,表明强视觉-语言模型骨干结合最小化设计,已足以在不依赖额外架构复杂度或工程技巧的条件下实现强劲性能。值得注意的是,我们的单个通用模型在公开真实世界RoboChallenge基准上性能超越$π_{0.5}$达20%。我们期望StarVLA-$α$能成为VLA领域未来研究的坚实起点。代码将发布于https://github.com/starVLA/starVLA。