Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.
翻译:通用型机器人的训练需要大规模、多样化的操作数据,然而真实世界的数据采集成本极高,现有仿真器又常受限于固定资产库和人工启发式规则。为弥合这一差距,我们提出V-Dreamer——一个全自动框架,能直接从自然语言指令生成开放词汇、可仿真的操作环境与可执行的专家轨迹。V-Dreamer采用新颖的生成式流水线,通过大型语言模型和3D生成模型构建具备物理基础的3D场景,并利用几何约束验证确保稳定无碰撞的布局。关键在于,为行为合成,我们利用视频生成模型作为丰富的运动先验,再通过基于CoTracker3和VGGT的鲁棒性Sim-to-Gen视觉-运动学对齐模块,将这些视觉预测映射为可执行的机器人轨迹。该流水线在不依赖人工干预的前提下,支持高度的视觉多样性与物理保真性。为评估生成数据,我们基于包含多样物体与环境变化的合成轨迹训练模仿学习策略。在Piper机械臂平台上进行的桌面操作任务广泛评估表明,我们的策略能稳健地泛化至仿真中未见物体,并实现有效的仿真到现实迁移,成功操控真实世界的新颖物体。