To tackle the "reality gap" encountered in Sim-to-Real transfer, this study proposes a diffusion-based framework that minimizes inconsistencies in grasping actions between the simulation settings and realistic environments. The process begins by training an adversarial supervision layout-to-image diffusion model(ALDM). Then, leverage the ALDM approach to enhance the simulation environment, rendering it with photorealistic fidelity, thereby optimizing robotic grasp task training. Experimental results indicate this framework outperforms existing models in both success rates and adaptability to new environments through improvements in the accuracy and reliability of visual grasping actions under a variety of conditions. Specifically, it achieves a 75\% success rate in grasping tasks under plain backgrounds and maintains a 65\% success rate in more complex scenarios. This performance demonstrates this framework excels at generating controlled image content based on text descriptions, identifying object grasp points, and demonstrating zero-shot learning in complex, unseen scenarios.
翻译:为应对仿真到现实迁移中存在的“现实差距”,本研究提出了一种基于扩散模型的框架,该框架能最小化仿真环境与真实环境之间抓取动作的不一致性。该流程首先训练一个对抗性监督布局到图像扩散模型(ALDM),随后利用ALDM方法增强仿真环境,使其具备照片级真实感,从而优化机器人抓取任务的训练。实验结果表明,该框架通过提升多种条件下视觉抓取动作的准确性和可靠性,在成功率和环境适应性方面均优于现有模型。具体而言,在单一背景下的抓取任务中实现了75%的成功率,且在更复杂场景中仍保持65%的成功率。这一性能表明,该框架能基于文本描述生成可控图像内容、识别物体抓取点,并在未见过的复杂场景中展现出零样本学习能力。