High-resolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of $1.4\times$ faster than ViTPose-Base.
翻译:高分辨率表示对于人体姿态估计模型获得良好性能至关重要。为获取此类特征,现有工作采用高分辨率输入图像或细粒度图像标记。然而,这种密集的高分辨率表示会带来显著的计算负担。本文探讨以下问题:“人体姿态估计仅需检测稀疏的人体关键点位置,是否真的有必要以密集、高分辨率的方式描述整幅图像?”基于动态Transformer模型,我们提出一种仅使用稀疏高分辨率表示进行人体姿态估计的框架(SHaRPose)。具体而言,SHaRPose包含两个阶段。在粗粒度阶段,动态挖掘图像区域与关键点之间的关系,同时生成粗估计结果。随后,应用质量预测器决定是否对粗估计结果进行细化。在细粒度阶段,SHaRPose仅在与关键点相关的区域构建稀疏高分辨率表示,并提供细化后的高精度人体姿态估计。大量实验证明了所提方法的卓越性能。具体而言,与最先进的ViTPose方法相比,我们的SHaRPose-Base模型在COCO验证集上达到77.4 AP(+0.5 AP),在COCO test-dev集上达到76.7 AP(+0.5 AP),且推理速度比ViTPose-Base快1.4倍。