Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.
翻译:强化学习已成为人形机器人运动控制的主流方法:策略能够可靠地从仿真迁移至硬件,并在受到干扰后优雅地恢复。然而,运动质量仍存在不足:仅基于任务奖励的模型往往收敛至僵硬、不对称的步态,而运动模仿方法虽能改善外观表现,却因参考信号可能对抗维持平衡所需的瞬时姿态,导致对外部干扰更为敏感。我们提出预测性风格匹配方法,其中离线预测器将机器人下肢状态历史与速度指令映射为可解读的上肢关节及步态目标参数,这些参数在训练过程中塑造奖励函数。由于目标参数基于状态条件而非时间索引,且预测器仅在训练阶段使用,部署后的控制器继承了纯任务奖励强化学习基线模型的本体感知接口与推理开销。在Unitree G1平台上,无论仿真还是实体实验,预测性风格匹配相比纯任务奖励方法可将上肢风格误差降低约一个数量级,同时保持其跌倒恢复率;而运动模仿基线虽能达到最低风格误差,但其干扰恢复失败频率却高出约五倍。