While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic computational complexity of ViTs has limited their applicability for processing high-resolution images and long videos. To address this challenge, we propose a simple method for reducing ViT's computational complexity based on selecting and processing a small number of most informative patches while disregarding others. We leverage a lightweight pose estimation network to guide the patch selection process, ensuring that the selected patches contain the most important information. Our experimental results on three widely used 2D pose estimation benchmarks, namely COCO, MPII and OCHuman, demonstrate the effectiveness of our proposed methods in significantly improving speed and reducing computational complexity with a slight drop in performance.
翻译:虽然卷积神经网络(CNN)在二维人体姿态估计领域已取得广泛成功,但视觉Transformer(ViT)作为CNN的有力替代方案,进一步提升了最先进性能。然而,ViT的二次计算复杂度限制了其在高分辨率图像和长视频处理中的适用性。为解决这一挑战,我们提出一种基于选择并处理少量信息量最大补丁而忽略其余补丁的简单方法,以降低ViT计算复杂度。我们利用轻量级姿态估计网络指导补丁选择过程,确保所选补丁包含最关键信息。在COCO、MPII和OCHuman这三个广泛使用的二维姿态估计基准上的实验结果表明,本文方法在显著提升速度并降低计算复杂度的同时,仅带来轻微的性能下降。