Embedding Classical Balance Control Principles in Reinforcement Learning for Humanoid Recovery

Humanoid robots remain vulnerable to falls and unrecoverable failure states, limiting their practical utility in unstructured environments. While reinforcement learning has demonstrated stand-up behaviors, existing approaches treat recovery as a pure task-reward problem without an explicit representation of the balance state. We present a unified RL policy that addresses this limitation by embedding classical balance metrics: capture point, center-of-mass state, and centroidal momentum, as privileged critic inputs and shaping rewards directly around these quantities during training, while the actor relies solely on proprioception for zero-shot hardware transfer. Without reference trajectories or scripted contacts, a single policy spans the full recovery spectrum: ankle and hip strategies for small disturbances, corrective stepping under large pushes, and compliant falling with multi-contact stand-up using the hands, elbows, and knees. Trained on the Unitree H1-2 in Isaac Lab, the policy achieves a 93.4% recovery rate across randomized initial poses and unscripted fall configurations. An ablation study shows that removing the balance-informed structure causes stand-up learning to fail entirely, confirming that these metrics provide a meaningful learning signal rather than incidental structure. Sim-to-sim transfer to MuJoCo and preliminary hardware experiments further demonstrate cross-environment generalization. These results show that embedding interpretable balance structure into the learning framework substantially reduces time spent in failure states and broadens the envelope of autonomous recovery.

翻译：人形机器人仍易受跌倒和不可恢复故障状态的影响，这限制了其在非结构化环境中的实际应用。尽管强化学习已展示出站起行为，但现有方法将恢复视为纯粹的任务奖励问题，缺乏对平衡状态的显式表征。我们提出了一种统一的强化学习策略，通过嵌入经典平衡度量——捕获点、质心状态和质心动量——来解决这一局限：在训练期间将这些量作为特权评论者输入并围绕其设计塑形奖励，而执行器仅依赖本体感知实现零样本硬件迁移。无需参考轨迹或预设接触，单一策略即可覆盖完整的恢复谱系：针对微小扰动的踝关节与髋关节策略、应对大幅推力的矫正步态，以及利用手、肘和膝进行多接触站立的顺应性跌倒。在Isaac Lab中使用Unitree H1-2进行训练后，该策略在随机初始姿态和非预设跌倒配置中实现了93.4%的恢复成功率。消融研究表明，移除基于平衡信息的结构会导致站起学习完全失败，证实这些度量提供了有意义的学习信号而非偶然结构。向MuJoCo的仿真间迁移及初步硬件实验进一步证明了跨环境泛化能力。这些结果表明，将可解释的平衡结构嵌入学习框架能显著减少处于故障状态的时间，并拓宽自主恢复的边界。