Recent advances in imitation learning have shown significant promise for robotic control and embodied intelligence. However, achieving robust generalization across diverse mounted camera observations remains a critical challenge. In this paper, we introduce a video-based spatial perception framework that leverages 3D spatial representations to address environmental variability, with a focus on handling lighting changes. Our approach integrates a novel image augmentation technique, AugBlender, with a state-of-the-art monocular depth estimation model trained on internet-scale data. Together, these components form a cohesive system designed to enhance robustness and adaptability in dynamic scenarios. Our results demonstrate that our approach significantly boosts the success rate across diverse camera exposures, where previous models experience performance collapse. Our findings highlight the potential of video-based spatial perception models in advancing robustness for end-to-end robotic learning, paving the way for scalable, low-cost solutions in embodied intelligence.
翻译:模仿学习的最新进展为机器人控制与具身智能展现出巨大潜力。然而,在不同搭载摄像头观测数据间实现鲁棒泛化仍是一个关键挑战。本文提出一种基于视频的空间感知框架,该框架利用三维空间表征应对环境变化,尤其侧重于处理光照变化。我们的方法将一种新颖的图像增强技术AugBlender与一个在互联网规模数据上训练的最先进单目深度估计模型相结合。这些组件共同构成一个旨在增强动态场景中鲁棒性与适应性的统一系统。实验结果表明,我们的方法显著提升了在不同摄像头曝光条件下的成功率,而先前模型在这些条件下会出现性能崩溃。我们的研究凸显了基于视频的空间感知模型在提升端到端机器人学习鲁棒性方面的潜力,为具身智能领域可扩展、低成本的解决方案铺平了道路。