End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.
翻译:端到端强化学习在人形机器人运动控制中因其紧凑的感知-动作映射而备受关注,然而实际策略常面临训练不稳定、特征融合效率低以及驱动成本高等问题。本文提出HuMam,一种以状态为中心的端到端强化学习框架,该框架采用单层Mamba编码器,将机器人中心状态、定向步态目标以及连续相位时钟进行融合。策略输出关节位置目标,由底层PD控制器进行跟踪,并使用PPO算法进行优化。一个简洁的六项奖励函数平衡了接触质量、摆动平滑性、足部放置、姿态和身体稳定性,同时隐式地促进了节能。在mc-mujoco环境中的JVRC-1人形机器人上,HuMam相较于一个强力的前馈基线,持续提升了学习效率、训练稳定性和整体任务性能,同时降低了功耗和扭矩峰值。据我们所知,这是首个采用Mamba作为融合骨干的端到端人形机器人强化学习控制器,在效率、稳定性和控制经济性方面均取得了切实的增益。