Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.
翻译:在真实世界中训练具身智能体需要熟练的操作人员和昂贵的硬件设备。仿真环境通过支持大规模、低成本的数据增强,提供了一种引人注目的替代方案。因此,以最小的仿真-现实差距快速构建高保真仿真场景已成为机器人学习中的关键目标。尽管基于重建的方法能提供卓越的视觉质量,但当前工作流程受限于低效的数据采集和欠佳的前景目标提取。为此,我们提出GASE——一个高度自动化的仿真场景构建系统。GASE利用全景相机阵列的多视角视频流实现快速环境扫描。为确保高质量资产生成,本流程引入基于相机姿态的策略,在二维域中鲁棒地提取跨帧目标,随后进行高保真场景修复。前景目标与静态背景被独立重建并无缝导入物理仿真器以进行策略训练。大量实验表明,GASE在分割精度上比现有基于三维高斯的方法提升超过10%,同时达到最优的修复质量。此外,在操作与导航任务中的真实机器人部署显示,其与纯真实世界数据训练的策略相比,性能差距维持在10%以内。这些结果证实GASE为弥合仿真-现实差距提供了高效且有效的解决方案。代码将开源。