Recent advancements in foundational models, such as large language models and world models, have greatly enhanced the capabilities of robotics, enabling robots to autonomously perform complex tasks. However, acquiring large-scale, high-quality training data for robotics remains a challenge, as it often requires substantial manual effort and is limited in its coverage of diverse real-world environments. To address this, we propose a novel hybrid approach called Compositional Simulation, which combines classical simulation and neural simulation to generate accurate action-video pairs while maintaining real-world consistency. Our approach utilizes a closed-loop real-sim-real data augmentation pipeline, leveraging a small amount of real-world data to generate diverse, large-scale training datasets that cover a broader spectrum of real-world scenarios. We train a neural simulator to transform classical simulation videos into real-world representations, improving the accuracy of policy models trained in real-world environments. Through extensive experiments, we demonstrate that our method significantly reduces the sim2real domain gap, resulting in higher success rates in real-world policy model training. Our approach offers a scalable solution for generating robust training data and bridging the gap between simulated and real-world robotics.
翻译:近年来,大语言模型和世界模型等基础模型的最新进展极大地提升了机器人技术的能力,使机器人能够自主执行复杂任务。然而,获取大规模、高质量的机器人训练数据仍是一项挑战,因为这通常需要大量人工投入,且对多样化真实环境的覆盖存在局限性。为解决这一问题,我们提出了一种名为"组合式仿真"的新型混合方法,该方法结合了经典仿真与神经仿真,在保持真实世界一致性的同时生成精确的动作-视频对。我们的方法采用闭环式真实-仿真-真实数据增强管线,利用少量真实数据生成覆盖更广泛真实场景的多样化、大规模训练数据集。我们训练了一个神经模拟器,将经典仿真视频转换为真实世界的表征,从而提升在真实环境中训练的策略模型的准确性。通过大量实验,我们证明该方法显著缩小了仿真到现实领域的差距,从而在真实世界策略模型训练中获得更高的成功率。我们的方法为生成鲁棒训练数据、弥合仿真与真实机器人领域之间的鸿沟提供了可扩展的解决方案。