In general, human pose estimation methods are categorized into two approaches according to their architectures: regression (i.e., heatmap-free) and heatmap-based methods. The former one directly estimates precise coordinates of each keypoint using convolutional and fully-connected layers. Although this approach is able to detect overlapped and dense keypoints, unexpected results can be obtained by non-existent keypoints in a scene. On the other hand, the latter one is able to filter the non-existent ones out by utilizing predicted heatmaps for each keypoint. Nevertheless, it suffers from quantization error when obtaining the keypoint coordinates from its heatmaps. In addition, unlike the regression one, it is difficult to distinguish densely placed keypoints in an image. To this end, we propose a hybrid model for single-stage multi-person pose estimation, named HybridPose, which mutually overcomes each drawback of both approaches by maximizing their strengths. Furthermore, we introduce self-correlation loss to inject spatial dependencies between keypoint coordinates and their visibility. Therefore, HybridPose is capable of not only detecting densely placed keypoints, but also filtering the non-existent keypoints in an image. Experimental results demonstrate that proposed HybridPose exhibits the keypoints visibility without performance degradation in terms of the pose estimation accuracy.
翻译:通常,人体姿态估计方法根据其架构分为两类:回归法(即无热图法)和基于热图的方法。前者利用卷积层和全连接层直接估计每个关键点的精确坐标。尽管这种方法能够检测重叠且密集的关键点,但可能因场景中不存在关键点而得到意外结果。另一方面,后者通过为每个关键点利用预测的热图来滤除不存在的关键点。然而,从热图中获取关键点坐标时存在量化误差。此外,与回归法不同,该方法难以区分图像中密集分布的关键点。为此,我们提出了一种用于单阶段多人姿态估计的混合模型,命名为HybridPose,该模型通过最大化两种方法的优势,相互克服各自的缺点。同时,我们引入了自相关损失,以注入关键点坐标与其可见性之间的空间依赖关系。因此,HybridPose不仅能够检测密集分布的关键点,还能滤除图像中不存在的关键点。实验结果表明,所提出的HybridPose在姿态估计精度方面无性能下降的情况下,展现了关键点的可见性。