Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
翻译:在机器人操作中,一个强大的视觉-语言-动作(VLA)基础模型有望在任务和平台之间实现可靠的泛化,同时确保成本效率(例如,适应所需的数据和GPU时长)。为此,我们利用来自9种流行双臂机器人配置的约20,000小时真实世界数据,开发了LingBot-VLA。通过在3个机器人平台上的系统评估(每个平台完成100项任务,每项任务包含130个后训练片段),我们的模型展现出对竞争者的明显优势,彰显了其强劲性能和广泛泛化能力。我们同时构建了一个高效的代码库,在8-GPU训练配置下实现每秒261个样本的吞吐量,相比于现有面向VLA的代码库,实现了1.5~2.8倍(取决于所依赖的VLM基础模型)的加速。上述特性确保我们的模型非常适合实际部署。为了推动机器人学习领域的发展,我们开源了代码、基础模型和基准数据,重点关注更具挑战性的任务和促进合理的评估标准。