Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is inherently heterogeneous across embodiments, sensors, and action spaces. We present a systematic, controlled study of VLA scaling that revisits core training choices for pretraining across diverse robots. Using a representative VLA framework that combines a vision-language backbone with flow-matching, we ablate key design decisions under matched conditions and evaluate in extensive simulation and real-robot experiments. To improve the reliability of real-world results, we introduce a Grouped Blind Ensemble protocol that blinds operators to model identity and separates policy execution from outcome judgment, reducing experimenter bias. Our analysis targets three dimensions of VLA scaling. (1) Physical alignment: we show that a unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer. (2) Embodiment mixture: we find that naively pooling heterogeneous robot datasets often induces negative transfer rather than gains, underscoring the fragility of indiscriminate data scaling. (3) Training regularization: we observe that intuitive strategies, such as sensory dropout and multi-stage fine-tuning, do not consistently improve performance at scale. Together, this study challenge some common assumptions about embodied scaling and provide practical guidance for training large-scale VLA policies from diverse robotic data. Project website: https://research.beingbeyond.com/rethink_vla

翻译：尽管视觉-语言-动作（VLA）模型在通用机器人控制方面展现出巨大潜力，但标准“数据规模化”范式是否——以及在何种条件下——适用于机器人领域仍不明确，因为该领域的训练数据在具体实现、传感器配置和动作空间方面天然具有异质性。本研究对VLA规模化问题进行了系统化、受控的实证研究，重新审视了跨机器人预训练的核心训练策略。通过采用结合视觉-语言主干网络与流匹配技术的代表性VLA框架，我们在严格匹配的实验条件下对关键设计决策进行消融分析，并在大规模仿真与真实机器人实验中开展评估。为提高现实世界实验结果的可靠性，我们提出了分组盲测集成协议：该协议对操作者隐藏模型身份，并将策略执行与结果评估分离，从而减少实验者偏差。我们的分析聚焦于VLA规模化的三个维度：（1）物理对齐：研究表明统一的末端执行器相对动作表示对实现跨实现方式的鲁棒迁移至关重要；（2）实现方式混合：发现简单汇集异构机器人数据集常导致负迁移而非性能提升，这揭示了无差别数据规模化的脆弱性；（3）训练正则化：观察到直觉性策略（如感知模态丢弃和多阶段微调）在大规模训练中并不能持续提升性能。本研究共同挑战了关于具身智能规模化的若干常见假设，并为基于异构机器人数据训练大规模VLA策略提供了实践指导。项目网站：https://research.beingbeyond.com/rethink_vla