MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Chengshu Li,Mengdi Xu,Arpit Bahety,Hang Yin,Yunfan Jiang,Huang Huang,Josiah Wong,Sujay Garlanka,Cem Gokmen,Ruohan Zhang,Weiyu Liu,Jiajun Wu,Roberto Martín-Martín,Li Fei-Fei

from arxiv, Project website: momagen.github.io. The first four authors contribute equally. Accpeted to International Conference on Learning Representations (ICLR 2026)

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming. This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms. Prior X-Gen works have developed automated data generation frameworks for static (bimanual) manipulation tasks, augmenting a few human demos in simulation with novel scene configurations to synthesize large-scale datasets. However, prior works fall short for bimanual mobile manipulation tasks for two major reasons: 1) a mobile base introduces the problem of how to place the robot base to enable downstream manipulation (reachability) and 2) an active camera introduces the problem of how to position the camera to generate data for a visuomotor policy (visibility). To address these challenges, MoMaGen formulates data generation as a constrained optimization problem that satisfies hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility while navigation). This formulation generalizes across most existing automated data generation approaches and offers a principled foundation for developing future methods. We evaluate on four multi-step bimanual mobile manipulation tasks and find that MoMaGen enables the generation of much more diverse datasets than previous methods. As a result of the dataset diversity, we also show that the data generated by MoMaGen can be used to train successful imitation learning policies using a single source demo. Furthermore, the trained policy can be fine-tuned with a very small amount of real-world data (40 demos) to be succesfully deployed on real robotic hardware. More details are on our project page: momagen.github.io.

翻译：从大规模、多样化的人类演示中进行模仿学习已被证明对机器人训练有效，但采集此类数据成本高昂且耗时。这一挑战在多步骤双手机器人移动操作中尤为突出，因为操作者需要同时远程控制移动底盘和两个高自由度机械臂。先前的X-Gen工作已为静态（双手）操作任务开发了自动化数据生成框架，通过在仿真中利用少量人类演示样本，结合新颖的场景配置来合成大规模数据集。然而，先前方法在双手机器人移动操作任务中存在两大不足：1）移动底盘的引入带来了如何放置机器人基座以实现后续操作（可达性）的问题；2）主动相机的引入产生了如何定位相机以生成视觉运动策略数据（可见性）的问题。为解决这些挑战，MoMaGen将数据生成构建为一个约束优化问题，在满足硬约束（如可达性）的同时平衡软约束（如导航过程中的可见性）。该框架概括了现有大多数自动化数据生成方法，并为未来方法的开发提供了理论基础。我们在四个多步骤双手机器人移动操作任务上进行了评估，发现MoMaGen能够生成比先前方法更多样化的数据集。得益于数据集的多样性，我们还证明使用MoMaGen生成的数据仅需单个源演示即可训练出成功的模仿学习策略。此外，训练后的策略仅需极少量真实世界数据（40个演示）进行微调，即可成功部署在真实机器人硬件上。更多细节请访问项目页面：momagen.github.io。