Humanoid robots remain vulnerable to falls and unrecoverable failure states, limiting their practical utility in unstructured environments. While reinforcement learning has demonstrated stand-up behaviors, existing approaches treat recovery as a pure task-reward problem without an explicit representation of the balance state. We present a unified RL policy that addresses this limitation by embedding classical balance metrics: capture point, center-of-mass state, and centroidal momentum, as privileged critic inputs and shaping rewards directly around these quantities during training, while the actor relies solely on proprioception for zero-shot hardware transfer. Without reference trajectories or scripted contacts, a single policy spans the full recovery spectrum: ankle and hip strategies for small disturbances, corrective stepping under large pushes, and compliant falling with multi-contact stand-up using the hands, elbows, and knees. Trained on the Unitree H1-2 in Isaac Lab, the policy achieves a 93.4% recovery rate across randomized initial poses and unscripted fall configurations. An ablation study shows that removing the balance-informed structure causes stand-up learning to fail entirely, confirming that these metrics provide a meaningful learning signal rather than incidental structure. Sim-to-sim transfer to MuJoCo and preliminary hardware experiments further demonstrate cross-environment generalization. These results show that embedding interpretable balance structure into the learning framework substantially reduces time spent in failure states and broadens the envelope of autonomous recovery.
翻译:人形机器人仍易受跌倒和不可恢复故障状态的影响,这限制了其在非结构化环境中的实际应用。尽管强化学习已展示出站起行为,但现有方法将恢复视为纯粹的任务奖励问题,缺乏对平衡状态的显式表征。我们提出了一种统一的强化学习策略,通过嵌入经典平衡度量——捕获点、质心状态和质心动量——来解决这一局限:在训练期间将这些量作为特权评论者输入并围绕其设计塑形奖励,而执行器仅依赖本体感知实现零样本硬件迁移。无需参考轨迹或预设接触,单一策略即可覆盖完整的恢复谱系:针对微小扰动的踝关节与髋关节策略、应对大幅推力的矫正步态,以及利用手、肘和膝进行多接触站立的顺应性跌倒。在Isaac Lab中使用Unitree H1-2进行训练后,该策略在随机初始姿态和非预设跌倒配置中实现了93.4%的恢复成功率。消融研究表明,移除基于平衡信息的结构会导致站起学习完全失败,证实这些度量提供了有意义的学习信号而非偶然结构。向MuJoCo的仿真间迁移及初步硬件实验进一步证明了跨环境泛化能力。这些结果表明,将可解释的平衡结构嵌入学习框架能显著减少处于故障状态的时间,并拓宽自主恢复的边界。