HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.

翻译：文本驱动的复杂交互多人运动生成仍然是一个具有挑战性的问题。尽管性能有所提升，但现有的离线方法（生成固定长度、固定参与者数量的运动）本质上在处理长文本或可变文本以及变化的参与者数量方面存在局限。这些局限性自然催生了自回归方法，其根据所有过去轨迹和当前文本指导逐步预测未来运动。在本工作中，我们提出了HINT，这是首个用于多人运动生成的自回归框架，采用扩散模型中的分层交互建模。首先，HINT在规范化潜在空间中利用解耦的运动表示，将局部运动语义与人际交互分离。这种设计便于直接适应不同数量的参与者，无需额外优化。其次，HINT采用滑动窗口策略进行高效的在线生成，并聚合窗口内的局部条件和跨窗口的全局条件，以捕捉过去的人类历史、人际依赖关系，并与文本指导对齐。该策略不仅能够在每个窗口内实现细粒度的交互建模，还能在整个长序列中保持长时程一致性。在公开基准上的大量实验表明，HINT的性能与强大的离线模型相当，并超越了自回归基线。值得注意的是，在InterHuman数据集上，HINT实现了3.100的FID分数，显著优于先前最先进的5.154分。