Human pose estimation on medium and small scales has long been a significant challenge in this field. Most existing methods focus on restoring high-resolution feature maps by stacking multiple costly deconvolutional layers or by continuously aggregating semantic information from low-resolution feature maps while maintaining high-resolution ones, which can lead to information redundancy. Additionally, due to quantization errors, heatmap-based methods have certain disadvantages in accurately locating keypoints of medium and small-scale human figures. In this paper, we propose HRPVT, which utilizes PVT v2 as the backbone to model long-range dependencies. Building on this, we introduce the High-Resolution Pyramid Module (HRPM), designed to generate higher quality high-resolution representations by incorporating the intrinsic inductive biases of Convolutional Neural Networks (CNNs) into the high-resolution feature maps. The integration of HRPM enhances the performance of pure transformer-based models for human pose estimation at medium and small scales. Furthermore, we replace the heatmap-based method with SimCC approach, which eliminates the need for costly upsampling layers, thereby allowing us to allocate more computational resources to HRPM. To accommodate models with varying parameter scales, we have developed two insertion strategies of HRPM, each designed to enhancing the model's ability to perceive medium and small-scale human poses from two distinct perspectives.
翻译:中小尺度人体姿态估计长期以来是该领域的重要挑战。现有方法大多通过堆叠多个昂贵的反卷积层,或在保持高分辨率特征图的同时持续聚合来自低分辨率特征图的语义信息来恢复高分辨率特征图,这可能导致信息冗余。此外,由于量化误差,基于热图的方法在精确定位中小尺度人体关键点方面存在一定劣势。本文提出HRPVT,其利用PVT v2作为骨干网络以建模长程依赖关系。在此基础上,我们引入了高分辨率金字塔模块(HRPM),该模块旨在通过将卷积神经网络(CNNs)固有的归纳偏置融入高分辨率特征图中,以生成更高质量的高分辨率表征。HRPM的集成增强了纯基于Transformer的模型在中小尺度人体姿态估计上的性能。此外,我们用SimCC方法替代了基于热图的方法,从而无需昂贵的上采样层,使我们能将更多计算资源分配给HRPM。为适应不同参数规模的模型,我们开发了两种HRPM插入策略,每种策略均旨在从两个不同视角增强模型感知中小尺度人体姿态的能力。