Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific reward shaping integrated with multi-critic and multi-discriminator learning, where dedicated networks capture the distinct dynamics and motion priors of each terrain type. We validate our approach on two humanoid platforms equipped with different stereo depth cameras. The resulting policy demonstrates robust performance across diverse environments, seamlessly handling extreme challenges such as high platforms and wide gaps, as well as fine-grained tasks including bidirectional long-term staircase traversal.
翻译:实现基于视觉的稳健人形机器人运动仍面临两大根本性挑战:仿真到现实的差距会引入显著的感知噪声,从而降低精细任务的性能;而跨多样化地形的统一策略训练则受到相互冲突的学习目标阻碍。为解决这些难题,我们提出了一种用于视觉驱动人形机器人运动的端到端框架。为实现稳健的仿真到现实迁移,我们开发了高保真深度传感器仿真系统,该系统能捕捉真实世界传感中固有的立体匹配伪影和标定不确定性。我们进一步提出一种视觉感知行为蒸馏方法,该方法将潜在空间对齐与噪声不变辅助任务相结合,从而实现了从特权高度图到含噪深度观测的有效知识迁移。针对多样化地形适应,我们引入了与多评价器-多判别器学习相结合的地形特定奖励塑形机制,其中专用网络负责捕捉各地形类型的独特动力学特性与运动先验。我们在配备不同立体深度相机的两种人形机器人平台上验证了所提方法。最终生成的策略在多样化环境中展现出稳健性能,能无缝处理高台与宽沟等极端挑战,同时胜任包括双向长距离楼梯穿越在内的精细任务。