Multi-person pose understanding from RGB videos involves three complex tasks: pose estimation, tracking and motion forecasting. Intuitively, accurate multi-person pose estimation facilitates robust tracking, and robust tracking builds crucial history for correct motion forecasting. Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately, which tends to make sub-optimal decision at each stage and also fail to exploit correlations among the three tasks. In this paper, we propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage. We propose an efficient yet powerful deformable attention mechanism to aggregate spatiotemporal information from the video snippet. Building upon this deformable attention, a video transformer is learned to encode the spatiotemporal features from the multi-frame snippet and to decode informative pose features for multi-person pose queries. Finally, these pose queries are regressed to predict multi-person pose trajectories and future motions in a single shot. In the experiments, we show the effectiveness of Snipper on three challenging public datasets where our generic model rivals specialized state-of-art baselines for pose estimation, tracking, and forecasting.
翻译:从RGB视频中理解多人姿态涉及三个复杂任务:姿态估计、跟踪与运动预测。直观而言,准确的多人姿态估计有助于鲁棒跟踪,而鲁棒跟踪则为正确的运动预测构建关键历史信息。现有工作大多聚焦单一任务,或采用多阶段方法分别解决多个任务,这种方案易在各阶段做出次优决策,且未能充分利用三项任务间的关联性。本文提出Snipper——一个统一框架,可在单阶段中同时实现多人三维姿态估计、跟踪与运动预测。我们提出一种高效且强大的可变形注意力机制,用于聚合视频片段的时空信息。基于该可变形注意力,学习一个视频Transformer对多帧片段中的时空特征进行编码,并为多人姿态查询解码信息丰富的姿态特征。最终,这些姿态查询通过回归方式一次性预测多人姿态轨迹与未来运动。实验表明,Snipper在三个具有挑战性的公开数据集上表现优异,我们提出的通用模型在姿态估计、跟踪与预测任务上均能匹敌专业化的现有最优基线。