Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
翻译:实现人体与广泛物体交互的真实模拟长期以来是一个基本目标。将基于物理的运动模仿扩展到复杂的人-物交互(HOI)具有挑战性,原因在于精细的人-物耦合、物体几何形状的多样性以及运动捕捉数据中的伪影,例如不准确的接触和有限的手部细节。我们提出了InterMimic框架,该框架使单个策略能够稳健地从覆盖与动态多样物体进行全身交互的数小时不完美运动捕捉数据中学习。我们的核心见解是采用课程策略——先求精,再扩展。我们首先训练特定对象的教师策略来模仿、重定向和精炼运动捕捉数据。接着,我们将这些教师策略蒸馏到一个学生策略中,教师策略作为在线专家提供直接监督以及高质量的参考。值得注意的是,我们在学生策略中融入了强化学习微调,以超越单纯的演示复制,实现更高质量的解决方案。我们的实验表明,InterMimic在多个HOI数据集上产生了真实且多样的交互。学习到的策略以零样本方式泛化,并能与运动学生成器无缝集成,从而将框架从单纯的模仿提升到复杂人-物交互的生成式建模。