Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to https://aislab-sustech.github.io/Stubborn/.
翻译:近年来,强化学习方法在提升人形机器人运动跟踪性能并实现扰动下的摔倒恢复方面展现出巨大潜力。然而,现有工作大多将运动跟踪与摔倒恢复视为不同任务,需要采用专用恢复奖励和/或独立恢复策略进行多阶段训练。此外,基于强化学习的现有方法常在严重跟踪失败后立即终止训练回合,限制了在不稳定或摔倒状态下的恢复导向探索。为解决上述问题,我们提出Stubborn——一种简化统一的强化学习框架,用于实现鲁棒的人形机器人运动跟踪与摔倒恢复。具体而言,Stubborn采用非对称Actor-Critic架构,包含三个核心组件。首先,采用偏航对齐的跟踪表示,在维护重力相关平衡信息的同时降低对全局漂移与航向扰动的敏感性。其次,引入基于伯努利分布的概率终止机制,使策略能在不同失效模式下鼓励探索摔倒恢复行为。第三,提出概率终止与跟踪误差驱动策略,根据跟踪性能动态重塑采样分布,提升对困难运动片段与不稳定状态的训练效率。与SOTA方法及消融实验的广泛对比表明,Stubborn取得了具有竞争力的性能,所提出的概率终止机制与自适应采样策略有效提升了性能与鲁棒性。真实世界演示请参见https://aislab-sustech.github.io/Stubborn/。