While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic computational complexity of ViTs has limited their applicability for processing high-resolution images. In this paper, we propose three methods for reducing ViT's computational complexity, which are based on selecting and processing a small number of most informative patches while disregarding others. The first two methods leverage a lightweight pose estimation network to guide the patch selection process, while the third method utilizes a set of learnable joint tokens to ensure that the selected patches contain the most important information about body joints. Experiments across six benchmarks show that our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%.
翻译:虽然卷积神经网络(CNN)在二维人体姿态估计领域取得了广泛成功,但视觉Transformer(ViT)作为CNN的有力替代方案,已推动该领域达到最新最优性能。然而,ViT的二次计算复杂度限制了其在处理高分辨率图像时的适用性。本文提出三种降低ViT计算复杂度的方法,其核心思想是选择并处理少量最具信息量的图像块,同时忽略其余块。前两种方法利用轻量级姿态估计网络指导块选择过程,第三种方法则通过一组可学习的关节标记确保所选块包含关于身体关节的最重要信息。在六个基准数据集上的实验表明,所提方法实现了30%至44%的计算复杂度显著降低,同时精度损失仅介于0%至3.5%之间。