Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of high-quality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data. This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry. By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.
翻译:实现人形机器人与人类的物理交互是一个关键前沿,但高质量的人形机器人交互数据的稀缺阻碍了进展。虽然利用丰富的人体交互数据提供了可扩展的替代方案,但我们首先证明标准重定向方法会破坏关键接触点而导致失败。为此,我们提出PAIR(物理感知交互重定向),这是一种以接触为中心的两阶段流程,能在形态差异下保持接触语义以生成物理一致的人形机器人交互数据。然而,这些高质量数据揭示了第二个失败点:传统的模仿学习策略仅能模仿轨迹而缺乏交互理解。因此,我们引入D-STAR(解耦时空动作推理器),这是一种分层策略,将“何时行动”与“何处行动”进行解耦。在D-STAR中,相位注意力模块(何时)与多尺度空间模块(何处)通过扩散头融合,产生超越单纯模仿的同步全身行为。通过解耦这些推理流,我们的模型能够学习鲁棒的时间相位而不受空间噪声干扰,从而实现响应灵敏的同步协作。我们通过广泛而严谨的仿真验证了该框架,相较于基线方法展现出显著的性能提升,并构建了从人体交互数据学习复杂全身交互的完整有效流程。