While Vision-Language-Action (VLA) models show strong generalizability in various tasks, real-world deployment of robotic policy still requires large-scale, high-quality human expert demonstrations. However, data collection via human teleoperation requires continuous operator attention, which is costly, hard to scale. To address this, we propose Genie Centurion (GCENT), a scalable and general data collection paradigm based on human rewind-and-refine guidance, enabling robots' interactive learning in deployment. GCENT starts at an imperfect policy and improves over time. When the robot execution failures occur, GCENT allows robots to revert to a previous state with a rewind mechanism, after which a teleoperator provides corrective demonstrations to refine the policy. This framework supports a one-human-to-many-robots supervision scheme with a Task Sentinel module, which autonomously predicts task success and solicits human intervention when necessary. Empirical results show that GCENT achieves up to 40% higher task success rates than state-of-the-art data collection methods, and reaches comparable performance using less than half the data in long-horizon and precise tasks. We also quantify the data yield-to-effort ratio under multi-robot scenarios, demonstrating GCENT's potential for scalable and cost-efficient robot policy training in real-world environments.
翻译:尽管视觉-语言-动作(VLA)模型在各种任务中展现出强大的泛化能力,但机器人策略在现实世界中的部署仍需要大规模、高质量的人类专家示范数据。然而,通过人类遥操作收集数据需要操作者持续保持注意力,这成本高昂且难以扩展。为解决此问题,我们提出了Genie Centurion(GCENT),一种基于人类回滚与精修指导的可扩展通用数据收集范式,使机器人能够在部署中进行交互式学习。GCENT从一个不完美的策略开始,并随时间推移不断改进。当机器人执行失败时,GCENT允许机器人通过回滚机制恢复到先前状态,随后由遥操作员提供纠正性示范以精修策略。该框架通过一个任务哨兵模块支持“一人监督多机器人”的方案,该模块能自主预测任务成功与否,并在必要时请求人工干预。实证结果表明,GCENT相比最先进的数据收集方法,任务成功率最高可提升40%,并且在长时程和精确任务中,使用不到一半的数据即可达到相当的性能。我们还量化了多机器人场景下的数据产出-投入比,证明了GCENT在现实环境中进行可扩展、高性价比机器人策略训练的潜力。