Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.
翻译:习得具有泛化性和鲁棒性的行为克隆策略需要大量高质量机器人数据。尽管人类示范(例如通过遥操作)是专家行为的标准来源,但在现实世界中大规模获取此类数据成本高昂。本文提出ExpertGen框架,该框架在仿真中自动化专家策略学习,以实现可扩展的仿真到现实迁移。ExpertGen首先使用在不完美示范(可由大语言模型合成或由人类提供)上训练的扩散策略初始化行为先验,随后通过强化学习在保持原始策略冻结的同时优化扩散模型的初始噪声,将该先验导向高任务成功率。通过冻结预训练的扩散策略,ExpertGen将探索范围约束在安全、类人的行为流形内,同时实现仅使用稀疏奖励的有效学习。在具有挑战性的操作基准测试上的实证评估表明,ExpertGen无需奖励工程即可可靠地生成高质量专家策略。在工业装配任务上,ExpertGen实现了90.5%的总体成功率,在长时域操作任务上达到85%的总体成功率,优于所有基线方法。所得策略展现出灵巧控制能力,并在多种初始配置和故障状态下保持鲁棒性。为验证仿真到现实的迁移,进一步通过DAgger将基于状态学习的专家策略蒸馏为视觉运动策略,并成功部署于真实机器人硬件。