Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit box detection processes with a unified representation and regression supervision. First, we introduce a human detection decoder from encoded tokens to extract global features. It can provide a good initialization for the latter keypoint detection, making the training process converge fast. Second, to bring in contextual information near keypoints, we regard pose estimation as a keypoint box detection problem to learn both box positions and contents for each keypoint. A human-to-keypoint detection decoder adopts an interactive learning strategy between human and keypoint features to further enhance global and local feature aggregation. In general, ED-Pose is conceptually simple without post-processing and dense heatmap supervision. It demonstrates its effectiveness and efficiency compared with both two-stage and one-stage methods. Notably, explicit box detection boosts the pose estimation performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.

翻译：本文提出一种基于显式边界框检测（Explicit Box Detection）的新型端到端多人姿态估计框架ED-Pose，该框架统一了人体层级（全局）与关键点层级（局部）信息的上下文学习。不同于以往的单阶段方法，ED-Pose通过统一表征与回归监督，将该任务重新定义为两个显式边界框检测过程。首先，我们引入一个从编码令牌中提取全局特征的人体检测解码器，为后续关键点检测提供良好的初始化，使训练过程快速收敛。其次，为引入关键点附近的上下文信息，我们将姿态估计视为关键点边界框检测问题，为每个关键点同时学习框的位置与内容。人体-关键点检测解码器采用人体特征与关键点特征间的交互学习策略，进一步强化全局与局部特征聚合。总体而言，ED-Pose概念简洁，无需后处理与密集热图监督。与两阶段及单阶段方法相比，该方法展现了出色的有效性与效率。值得注意的是，显式边界框检测在COCO和CrowdPose数据集上分别提升了4.5 AP和9.9 AP的姿态估计性能。作为首个采用L1回归损失的完全端到端框架，ED-Pose在相同骨干网络下以1.2 AP的优势超越基于热图的Top-down方法（COCO），并在CrowdPose上以76.6 AP达到当前最优水平，无需任何额外技巧。代码已开源：https://github.com/IDEA-Research/ED-Pose。