基于Mamba2的视觉-本体感知融合在端到端运动控制强化学习中的应用 (Vision-Proprioception Fusion with Mamba2 in End-to-End Reinforcement Learning for Motion Control)

End-to-end reinforcement learning (RL) for motion control trains policies directly from sensor inputs to motor commands, enabling unified controllers for different robots and tasks. However, most existing methods are either blind (proprioception-only) or rely on fusion backbones with unfavorable compute-memory trade-offs. Recurrent controllers struggle with long-horizon credit assignment, and Transformer-based fusion incurs quadratic cost in token length, limiting temporal and spatial context. We present a vision-driven cross-modal RL framework built on SSD-Mamba2, a selective state-space backbone that applies state-space duality (SSD) to enable both recurrent and convolutional scanning with hardware-aware streaming and near-linear scaling. Proprioceptive states and exteroceptive observations (e.g., depth tokens) are encoded into compact tokens and fused by stacked SSD-Mamba2 layers. The selective state-space updates retain long-range dependencies with markedly lower latency and memory use than quadratic self-attention, enabling longer look-ahead, higher token resolution, and stable training under limited compute. Policies are trained end-to-end under curricula that randomize terrain and appearance and progressively increase scene complexity. A compact, state-centric reward balances task progress, energy efficiency, and safety. Across diverse motion-control scenarios, our approach consistently surpasses strong state-of-the-art baselines in return, safety (collisions and falls), and sample efficiency, while converging faster at the same compute budget. These results suggest that SSD-Mamba2 provides a practical fusion backbone for resource-constrained robotic and autonomous systems in engineering informatics applications.

翻译：端到端强化学习（RL）用于运动控制，可直接从传感器输入训练策略以输出电机指令，从而为不同机器人和任务提供统一的控制器。然而，现有方法大多要么是“盲视”的（仅依赖本体感知），要么依赖于计算-内存权衡不佳的融合主干网络。循环控制器难以处理长时程的信用分配，而基于Transformer的融合则因令牌长度导致二次方计算成本，限制了时空上下文的有效利用。本文提出一种基于SSD-Mamba2的视觉驱动跨模态RL框架，该框架采用选择性状态空间主干网络，通过状态空间对偶性（SSD）实现循环与卷积扫描，并具备硬件感知的流式处理和近线性扩展能力。本体感知状态与外感知观测（如深度令牌）被编码为紧凑令牌，并通过堆叠的SSD-Mamba2层进行融合。选择性状态空间更新在显著低于二次方自注意力机制的延迟和内存消耗下保持长程依赖关系，从而在有限计算资源下实现更长的前瞻视野、更高的令牌分辨率以及稳定的训练。策略在随机化地形与外观、并逐步增加场景复杂度的课程设置下进行端到端训练。一种紧凑的、以状态为中心的任务奖励平衡了任务进度、能源效率与安全性。在多种运动控制场景中，本方法在回报率、安全性（碰撞与跌倒）和样本效率方面均持续超越现有先进基线，并在相同计算预算下收敛更快。这些结果表明，SSD-Mamba2为工程信息学应用中资源受限的机器人及自主系统提供了一种实用的融合主干网络。