Recent advancements in legged robot perceptive locomotion have shown promising progress. However, terrain-aware humanoid locomotion remains largely constrained to two paradigms: depth image-based end-to-end learning and elevation map-based methods. The former suffers from limited training efficiency and a significant sim-to-real gap in depth perception, while the latter depends heavily on multiple vision sensors and localization systems, resulting in latency and reduced robustness. To overcome these challenges, we propose a novel framework that tightly integrates three key components: (1) Terrain-Aware Locomotion Policy with a Blind Backbone, which leverages pre-trained elevation map-based perception to guide reinforcement learning with minimal visual input; (2) Multi-Modality Cross-Attention Transformer, which reconstructs structured terrain representations from noisy depth images; (3) Realistic Depth Images Synthetic Method, which employs self-occlusion-aware ray casting and noise-aware modeling to synthesize realistic depth observations, achieving over 30\% reduction in terrain reconstruction error. This combination enables efficient policy training with limited data and hardware resources, while preserving critical terrain features essential for generalization. We validate our framework on a full-sized humanoid robot, demonstrating agile and adaptive locomotion across diverse and challenging terrains.
翻译:近年来,腿式机器人的感知运动研究取得了显著进展。然而,地形感知的人形机器人运动仍主要受限于两种范式:基于深度图像的端到端学习方法与基于高程地图的方法。前者存在训练效率有限及深度感知中显著的仿真到现实差距问题,而后者高度依赖多个视觉传感器与定位系统,导致延迟并降低鲁棒性。为克服这些挑战,我们提出一种新颖框架,紧密整合了三个关键组件:(1) 具备盲主干网络的地形感知运动策略,利用预训练的基于高程地图的感知能力,以最小化视觉输入引导强化学习;(2) 多模态跨注意力Transformer,从含噪深度图像中重建结构化地形表征;(3) 真实深度图像合成方法,采用自遮挡感知的光线投射与噪声感知建模来合成真实深度观测,实现地形重建误差降低超过30%。该组合能够在有限数据与硬件资源下实现高效策略训练,同时保留对泛化至关重要的关键地形特征。我们在全尺寸人形机器人上验证了该框架,展示了其在多样且具挑战性地形上的敏捷自适应运动能力。