CROSS-GAiT: Cross-Attention-Based Multimodal Representation Fusion for Parametric Gait Adaptation in Complex Terrains

We present CROSS-GAiT, a novel algorithm for quadruped robots that uses Cross Attention to fuse terrain representations derived from visual and time-series inputs, including linear accelerations, angular velocities, and joint efforts. These fused representations are used to adjust the robot's step height and hip splay, enabling adaptive gaits that respond dynamically to varying terrain conditions. We generate these terrain representations by processing visual inputs through a masked Vision Transformer (ViT) encoder and time-series data through a dilated causal convolutional encoder. The cross-attention mechanism then selects and integrates the most relevant features from each modality, combining terrain characteristics with robot dynamics for better-informed gait adjustments. CROSS-GAiT uses the combined representation to dynamically adjust gait parameters in response to varying and unpredictable terrains. We train CROSS-GAiT on data from diverse terrains, including asphalt, concrete, brick pavements, grass, dense vegetation, pebbles, gravel, and sand. Our algorithm generalizes well and adapts to unseen environmental conditions, enhancing real-time navigation performance. CROSS-GAiT was implemented on a Ghost Robotics Vision 60 robot and extensively tested in complex terrains with high vegetation density, uneven/unstable surfaces, sand banks, deformable substrates, etc. We observe at least a 7.04% reduction in IMU energy density and a 27.3% reduction in total joint effort, which directly correlates with increased stability and reduced energy usage when compared to state-of-the-art methods. Furthermore, CROSS-GAiT demonstrates at least a 64.5% increase in success rate and a 4.91% reduction in time to reach the goal in four complex scenarios. Additionally, the learned representations perform 4.48% better than the state-of-the-art on a terrain classification task.

翻译：本文提出CROSS-GAiT，一种用于四足机器人的新型算法，该算法利用交叉注意力机制融合来自视觉输入与时间序列输入（包括线性加速度、角速度及关节力矩）的地形表征。这些融合后的表征被用于调整机器人的步高与髋部外展角度，从而实现对多变地形条件动态响应的自适应步态。我们通过掩码视觉Transformer（ViT）编码器处理视觉输入，并通过膨胀因果卷积编码器处理时间序列数据，以生成这些地形表征。交叉注意力机制随后从各模态中选择并整合最相关的特征，将地形特性与机器人动力学相结合，以实现更精准的步态调整。CROSS-GAiT利用该融合表征动态调整步态参数，以应对多变且不可预测的地形。我们在多种地形数据上训练CROSS-GAiT，包括沥青、混凝土、砖砌路面、草地、茂密植被、卵石、砾石和沙地。我们的算法展现出良好的泛化能力，并能适应未见过的环境条件，从而提升实时导航性能。CROSS-GAiT在Ghost Robotics Vision 60机器人上实现，并在植被密度高、表面不平整/不稳定、沙堤、可变形基质等复杂地形中进行了广泛测试。与现有先进方法相比，我们观察到IMU能量密度至少降低7.04%，总关节力矩减少27.3%，这直接关联到稳定性提升与能耗降低。此外，在四种复杂场景中，CROSS-GAiT的成功率至少提高64.5%，到达目标时间减少4.91%。另外，所学表征在地形分类任务上的性能较现有先进方法提升4.48%。