End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency. Project page: https://github.com/zgspose/PAVENet.

翻译：现有的多人视频姿态估计方法通常采用两阶段流程：首先在每帧中检测个体，随后通过时序建模进行单人姿态估计。该设计依赖于检测、感兴趣区域裁剪和非极大值抑制等启发式操作，限制了方法的准确性和效率。本文提出了一种完全端到端的视频多人二维姿态估计框架，有效消除了启发式操作。一个关键挑战在于如何在复杂且重叠的时序轨迹中跨帧关联个体。为此，我们提出了一种新颖的姿态感知视频Transformer网络（PAVE-Net），其包含用于建模帧内关系的空间编码器，以及用于捕捉跨帧全局依赖关系的时空姿态解码器。为实现精确的时序关联，我们提出了一种姿态感知注意力机制，使每个姿态查询能够选择性地聚合连续帧中对应同一个体的特征。此外，我们显式建模姿态关键点间的时空依赖关系以提升准确性。值得注意的是，本方法是首个面向多帧二维人体姿态估计的端到端方法。大量实验表明，PAVE-Net显著优于先前的基于图像的端到端方法，在PoseTrack2017数据集上实现了6.0 mAP的性能提升，其准确性与最先进的基于视频的两阶段方法相当，同时在效率上具有显著优势。项目页面：https://github.com/zgspose/PAVENet。