Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
翻译:人类很少在显式全身动作的层面上规划与物体的全身交互。高层意图(如可供性)定义了目标,而协调的平衡、接触和操控可以从底层的物理与运动先验中自然涌现。规模化此类先验是实现人形机器人在多样化场景中组合并泛化移动操控技能,同时保持物理连贯的全身协调的关键。为此,我们提出了InterPrior,一个通过大规模模仿预训练与强化学习后训练来学习统一生成式控制器的可扩展框架。InterPrior首先将全参考模仿专家提炼为一个多功能的、目标条件化的变分策略,该策略能够从多模态观测和高层意图重建运动。尽管提炼后的策略能够重建训练行为,但由于大规模人-物交互的配置空间极为庞大,其泛化能力并不可靠。为解决此问题,我们应用了物理扰动的数据增强,随后进行强化学习微调,以提升对未见目标和初始状态的应对能力。这些步骤共同将重建的潜在技能整合到一个有效的流形中,从而产生一个能够超越训练数据泛化的运动先验,例如,它可以整合新的行为,如与未见物体的交互。我们进一步展示了其在用户交互控制中的有效性及其在真实机器人部署中的潜力。