Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.Code and videos: https://lift-humanoid.github.io
翻译:强化学习(RL)已广泛应用于人形机器人控制,其中如近端策略优化(PPO)等在线策略方法通过大规模并行仿真实现了鲁棒的训练,并在某些情况下能够零样本部署到真实机器人。然而,在线策略算法的低样本效率限制了其在新环境中的安全适应能力。尽管离线策略RL和基于模型的RL已展现出更高的样本效率,但人形机器人控制中大规模预训练与高效微调之间的差距依然存在。本文发现,采用大批量更新和高更新数据比(UTD)的离线策略软演员-评论家(SAC)算法,能够可靠地支持人形机器人运动策略的大规模预训练,并实现真实机器人的零样本部署。在适应新环境方面,我们证明这些经过SAC预训练的策略可通过基于模型的方法在新环境和分布外任务中进行微调。在新环境中的数据收集采用确定性策略执行,而随机探索则被限制在基于物理信息的世界模型中。这种分离方式降低了适应过程中随机探索的风险,同时保持了改进所需的探索覆盖范围。总体而言,该方法结合了预训练阶段大规模仿真的时间效率与微调阶段基于模型学习的样本效率。代码与视频:https://lift-humanoid.github.io