The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.
翻译:机器人操作的可扩展性根本上受限于任务对齐的物理交互数据匮乏。尽管视觉语言模型(VLM)与视频生成模型(VGM)在自主数据合成方面具有潜力,但它们分别存在语义-空间错位与物理幻觉问题。为弥合这一鸿沟,我们提出RoboEvolve——一种将VLM计划器与VGM模拟器耦合为相互强化的协同进化循环的新型框架。该框架仅需无标注种子图像即可运行,利用认知启发的双阶段机制:(i)日间探索阶段通过语义控制的多粒度奖励促进基于物理的行为发现;(ii)夜间巩固阶段挖掘"近乎失败"案例以稳定策略优化。在自主渐进课程引导下,系统自然实现从简单原子动作到复杂任务的扩展。大量实验表明,RoboEvolve(I)具有卓越的有效性,使基础计划器绝对性能提升30个百分点,模拟器成功率平均提高48%;(II)展现极端数据效率,仅用500张无标注种子图像即超越全监督基线,数据量减少50倍;(III)展现出无灾难性遗忘的稳健持续学习能力。