In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.
翻译:近年来,二维人体姿态估计在公开基准测试中取得了显著进展。然而,由于参数量庞大和计算开销过高,许多方法在工业界的适用性面临挑战。高效的人体姿态估计仍是一个难题,尤其是对于包含大量关键点的全身姿态估计。当前大多数高效人体姿态估计方法主要依赖于CNN,我们提出了基于分组的令牌剪枝Transformer(GTPT),它充分利用了Transformer的优势。GTPT通过以从粗到细的方式逐步引入关键点来减轻计算负担。它在确保高性能的同时,最大限度地减少了计算开销。此外,GTPT对关键点令牌进行分组并剪枝视觉令牌,以减少冗余并提升模型性能。我们提出了不同分组间的多头分组注意力(MHGA),以极小的计算开销实现全局交互。我们在COCO和COCO-WholeBody数据集上进行了实验。与其他方法相比,实验结果表明GTPT能够以更少的计算量实现更高的性能,尤其是在具有大量关键点的全身姿态估计任务中。