We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.
翻译:我们提出Green-VLA,一种分阶段的视觉-语言-动作框架,旨在实现Green人形机器人的实际部署,同时保持对不同具身形态的泛化能力。Green-VLA遵循五阶段课程设计:(L0) 基础视觉语言模型,(L1) 多模态接地,(R0) 多具身预训练,(R1) 特定具身适应,以及(R2) 强化学习策略对齐。我们结合了可扩展的数据处理流程(包含3000小时演示数据)与时间对齐及质量过滤技术,并采用统一的具身感知动作接口,使得单一策略能够控制人形机器人、移动机械臂和固定基座机械臂。在推理阶段,该VLA控制器通过情节进度预测、分布外检测以及基于联合预测的引导机制进行增强,以提升安全性和精确目标选择能力。在Simpler BRIDGE WidowX和CALVIN ABC-D上的实验,以及真实机器人评估结果表明,强化学习对齐在成功率、鲁棒性和长时域任务效率方面带来了显著的泛化能力和性能提升。