RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.

翻译：机器人操作的可扩展性根本上受限于任务对齐的物理交互数据匮乏。尽管视觉语言模型（VLM）与视频生成模型（VGM）在自主数据合成方面具有潜力，但它们分别存在语义-空间错位与物理幻觉问题。为弥合这一鸿沟，我们提出RoboEvolve——一种将VLM计划器与VGM模拟器耦合为相互强化的协同进化循环的新型框架。该框架仅需无标注种子图像即可运行，利用认知启发的双阶段机制：（i）日间探索阶段通过语义控制的多粒度奖励促进基于物理的行为发现；（ii）夜间巩固阶段挖掘"近乎失败"案例以稳定策略优化。在自主渐进课程引导下，系统自然实现从简单原子动作到复杂任务的扩展。大量实验表明，RoboEvolve（I）具有卓越的有效性，使基础计划器绝对性能提升30个百分点，模拟器成功率平均提高48%；（II）展现极端数据效率，仅用500张无标注种子图像即超越全监督基线，数据量减少50倍；（III）展现出无灾难性遗忘的稳健持续学习能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML 2026】 StableVLA：无需额外数据，基于信息瓶颈的自适应鲁棒性视觉-语言-动作模型

专知会员服务

6+阅读 · 5月19日

机器人领域中的视觉-语言-动作模型：数据集、基准测试与数据引擎综述

专知会员服务

13+阅读 · 4月29日

【伯克利博士论文】物理世界中可泛化且可扩展的机器人学习

专知会员服务

22+阅读 · 1月18日

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

34+阅读 · 2025年8月19日