Reliable terrain perception is a critical prerequisite for the deployment of humanoid robots in unstructured, human-centric environments. While traditional systems often rely on manually engineered, single-sensor pipelines, this paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation. A hybrid Encoder-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction fused with a Gated Recurrent Unit (GRU) core for temporal consistency. The architecture integrates multimodal data from an Intel RealSense depth camera, a LIVOX MID-360 LiDAR processed via efficient spherical projection, and an onboard IMU. Quantitative results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations. Furthermore, the integration of a 3.2 s temporal context reduces mapping drift.
翻译:可靠的地形感知是人形机器人在非结构化、以人为中心环境中部署的关键前提。传统系统通常依赖于人工设计的单传感器处理流程,而本文提出了一种基于学习的框架,该框架使用一种中间化的、以机器人为中心的高度图表示。本文引入了一种混合编码器-解码器结构,该结构利用卷积神经网络进行空间特征提取,并与门控循环单元核心融合以实现时间一致性。该架构集成了来自英特尔实感深度相机、通过高效球面投影处理的LIVOX MID-360激光雷达以及板载惯性测量单元的多模态数据。定量结果表明,多模态融合相较于纯深度配置将重建精度提高了7.2%,相较于纯激光雷达配置提高了9.9%。此外,整合3.2秒的时间上下文有效减少了建图漂移。