Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/
翻译:视觉-语言-动作模型(VLA)标志着机器人学习领域的重大范式转变。该模型通过大规模数据采集与场景定制化微调,取代了专家策略中传统专用架构与任务定制化组件。在这种以模型与可扩展训练为核心的机器学习工作流中,传统机器人软件框架已成为瓶颈,而机器人仿真系统对虚实实验迁移的支持亦存在局限。本研究通过提出机器人控制栈(RCS)填补了这一空白——这是一个从头设计的轻量级生态系统,专为支持基于大规模通用策略的机器人学习研究而构建。RCS的核心采用模块化、易扩展的分层架构,为仿真与实体机器人提供统一接口,有效促进仿真到现实的迁移。尽管具有极小的资源占用与依赖项,该系统仍提供完整的功能集,可同时支持实体实验与仿真环境中的大规模训练。本研究的贡献包含两方面:首先,我们阐述了RCS的架构体系并解析其设计原则;其次,我们通过VLA与强化学习策略的开发全周期评估了其可用性与性能。实验部分还对Octo、OpenVLA及Pi Zero模型在多种机器人平台进行了系统性评估,并揭示了仿真数据提升现实世界策略性能的内在机制。相关代码、数据集、权重及演示视频已发布于:https://robotcontrolstack.github.io/