Neuromorphic Reinforcement Learning for Quadruped Locomotion Control on Uneven Terrain

Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3\(\times\) compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.

翻译：强化学习（RL）已在复杂地形上实现了稳健的四足运动，但大多数学习控制器是通过在大规模并行仿真中利用反向传播进行离线训练，并以固定策略部署，这限制了在机载功耗约束下对地形变化、载荷变化、执行器磨损及其他真实世界条件的适应性。局部学习通过将全局反向传播图替换为由局部神经状态驱动的更新，提供了一条实现能量感知的机载自适应途径，使学习规则更适用于神经形态和存内计算基底。本文提出了一种基于平衡传播（EP）的近似策略优化（PPO）框架，用于不平地形上的四足运动。该控制器结合了仿生中枢模式发生器（CPG）策略与残差姿态调整策略，并将传统反向传播训练的策略网络和价值网络替换为支持EP的局部学习。为了用EP训练随机连续控制策略，我们推导了一种与EP兼容的PPO输出扰动信号，并引入了一种双边比率裁剪机制，用于在松弛过程中稳定策略更新。在12自由度A1四足机器人上的实验表明，所提控制器在两阶段不平地形运动任务中实现了稳定的策略收敛。其运动性能在与反向传播训练的PPO基线相比时，在成功率、速度跟踪、执行器功率和身体稳定性方面相当，同时与时间反向传播（BPTT）相比，GPU内存效率提升了4.3倍。这些结果表明，基于局部平衡的学习能够支持高维具身运动，并为低功耗机载自适应和微调提供了算法基础。

相关内容

反向传播

关注 354

反向传播一词严格来说仅指用于计算梯度的算法，而不是指如何使用梯度。但是该术语通常被宽松地指整个学习算法，包括如何使用梯度，例如通过随机梯度下降。反向传播将增量计算概括为增量规则中的增量规则，该规则是反向传播的单层版本，然后通过自动微分进行广义化，其中反向传播是反向累积（或“反向模式”）的特例。在机器学习中，反向传播（backprop）是一种广泛用于训练前馈神经网络以进行监督学习的算法。对于其他人工神经网络（ANN）都存在反向传播的一般化–一类算法，通常称为“反向传播”。反向传播算法的工作原理是，通过链规则计算损失函数相对于每个权重的梯度，一次计算一层，从最后一层开始向后迭代，以避免链规则中中间项的冗余计算。

自动驾驶中的多智能体强化学习综述

专知会员服务

48+阅读 · 2024年8月20日

《用于水下目标定位的平台便携式强化学习方法》

专知会员服务

28+阅读 · 2024年1月2日

【牛津大学博士论文】在大状态行动空间中的强化学习, 288页pdf

专知会员服务

54+阅读 · 2023年6月10日

《利用卷积神经网络通过强化学习开发稳健的战斗行为》132页论文

专知会员服务

53+阅读 · 2023年5月22日