RoboAug：通过区域对比数据增强实现单标注生成百场景的机器人操作学习 (RoboAug: One Annotation to Hundreds of Scenes via Region-Contrastive Data Augmentation for Robotic Manipulation)

Enhancing the generalization capability of robotic learning to enable robots to operate effectively in diverse, unseen scenes is a fundamental and challenging problem. Existing approaches often depend on pretraining with large-scale data collection, which is labor-intensive and time-consuming, or on semantic data augmentation techniques that necessitate an impractical assumption of flawless upstream object detection in real-world scenarios. In this work, we propose RoboAug, a novel generative data augmentation framework that significantly minimizes the reliance on large-scale pretraining and the perfect visual recognition assumption by requiring only the bounding box annotation of a single image during training. Leveraging this minimal information, RoboAug employs pre-trained generative models for precise semantic data augmentation and integrates a plug-and-play region-contrastive loss to help models focus on task-relevant regions, thereby improving generalization and boosting task success rates. We conduct extensive real-world experiments on three robots, namely UR-5e, AgileX, and Tien Kung 2.0, spanning over 35k rollouts. Empirical results demonstrate that RoboAug significantly outperforms state-of-the-art data augmentation baselines. Specifically, when evaluating generalization capabilities in unseen scenes featuring diverse combinations of backgrounds, distractors, and lighting conditions, our method achieves substantial gains over the baseline without augmentation. The success rates increase from 0.09 to 0.47 on UR-5e, from 0.16 to 0.60 on AgileX, and from 0.19 to 0.67 on Tien Kung 2.0. These results highlight the superior generalization and effectiveness of RoboAug in real-world manipulation tasks. Our project is available at https://x-roboaug.github.io/.

翻译：提升机器人学习的泛化能力，使其能够在多样化的未见场景中有效操作，是一个基础且具有挑战性的问题。现有方法通常依赖于大规模数据收集的预训练，这需要大量人力且耗时，或者依赖于语义数据增强技术，这些技术在实际场景中需要上游物体检测完美无瑕这一不切实际的假设。在本工作中，我们提出了RoboAug，一种新颖的生成式数据增强框架，它通过在训练过程中仅需单张图像的边界框标注，显著降低了对大规模预训练和完美视觉识别假设的依赖。利用这一最少信息，RoboAug采用预训练的生成模型进行精确的语义数据增强，并集成了一个即插即用的区域对比损失，以帮助模型聚焦于任务相关区域，从而提升泛化能力并提高任务成功率。我们在UR-5e、AgileX和Tien Kung 2.0三款机器人上进行了广泛的真实世界实验，总计超过3.5万次执行。实证结果表明，RoboAug显著优于最先进的数据增强基线方法。具体而言，在评估包含多样化背景、干扰物和光照条件组合的未见场景中的泛化能力时，我们的方法相比未使用增强的基线取得了大幅提升。在UR-5e上，成功率从0.09提升至0.47；在AgileX上，从0.16提升至0.60；在Tien Kung 2.0上，从0.19提升至0.67。这些结果突显了RoboAug在真实世界操作任务中卓越的泛化能力和有效性。我们的项目地址为 https://x-roboaug.github.io/。