Multi-person motion prediction is a challenging task, especially for real-world scenarios of densely interacted persons. Most previous works have been devoted to studying the case of weak interactions (e.g., hand-shaking), which typically forecast each human pose in isolation. In this paper, we focus on motion prediction for multiple persons with extreme collaborations and attempt to explore the relationships between the highly interactive persons' motion trajectories. Specifically, a novel cross-query attention (XQA) module is proposed to bilaterally learn the cross-dependencies between the two pose sequences tailored for this situation. Additionally, we introduce and build a proxy entity to bridge the involved persons, which cooperates with our proposed XQA module and subtly controls the bidirectional information flows, acting as a motion intermediary. We then adapt these designs to a Transformer-based architecture and devise a simple yet effective end-to-end framework called proxy-bridged game Transformer (PGformer) for multi-person interactive motion prediction. The effectiveness of our method has been evaluated on the challenging ExPI dataset, which involves highly interactive actions. We show that our PGformer consistently outperforms the state-of-the-art methods in both short- and long-term predictions by a large margin. Besides, our approach can also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets and achieve encouraging results. Our code will become publicly available upon acceptance.
翻译:多人运动预测是一项具有挑战性的任务,尤其针对实际场景中密集交互的人群。以往研究大多致力于弱交互场景(如握手),通常孤立地预测每个人的姿态。本文聚焦于具有极端协作关系的多人运动预测,尝试探索高度交互人群运动轨迹间的关联。具体而言,我们提出了一种新颖的交叉查询注意力(XQA)模块,用于双向学习针对该场景设计的两个姿态序列之间的交叉依赖关系。此外,我们引入并构建了一个代理实体来桥接相关人员,该实体与我们提出的XQA模块协同工作,巧妙控制双向信息流,充当运动中介。基于这些设计,我们构建了一个基于Transformer的架构,并设计出一个简洁有效的端到端框架——代理桥接游戏Transformer(PGformer),用于多人交互运动预测。我们方法的效果在涉及高度交互动作的具有挑战性的ExPI数据集上进行了评估。结果表明,我们的PGformer在短期和长期预测中均以较大优势持续超越现有最先进方法。此外,我们的方法也能兼容弱交互的CMU-Mocap和MuPoTS-3D数据集,并取得了令人鼓舞的结果。我们的代码将在论文接收后公开。