FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER

Recently, vision transformers have shown great success in a set of human reconstruction tasks such as 2D human pose estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh reconstruction (HMR) tasks. In these tasks, feature map representations of the human structural information are often extracted first from the image by a CNN (such as HRNet), and then further processed by transformer to predict the heatmaps (encodes each joint's location into a feature map with a Gaussian distribution) for HPE or HMR. However, existing transformer architectures are not able to process these feature map inputs directly, forcing an unnatural flattening of the location-sensitive human structural information. Furthermore, much of the performance benefit in recent HPE and HMR methods has come at the cost of ever-increasing computation and memory needs. Therefore, to simultaneously address these problems, we propose FeatER, a novel transformer design that preserves the inherent structure of feature map representations when modeling attention while reducing memory and computational costs. Taking advantage of FeatER, we build an efficient network for a set of human reconstruction tasks including 2D HPE, 3D HPE, and HMR. A feature map reconstruction module is applied to improve the performance of the estimated human pose and mesh. Extensive experiments demonstrate the effectiveness of FeatER on various human pose and mesh datasets. For instance, FeatER outperforms the SOTA method MeshGraphormer by requiring 5% of Params and 16% of MACs on Human3.6M and 3DPW datasets. The project webpage is https://zczcwh.github.io/feater_page/.

翻译：近年来，视觉变换器在二维人体姿态估计、三维人体姿态估计及人体网格重建等人体重建任务中取得了显著成功。在这些任务中，通常先通过CNN（如HRNet）从图像中提取人体结构信息的特征图表示，再经变换器进一步处理以生成热力图（将各关节位置编码为具有高斯分布的特征图），用于姿态估计或网格重建。然而，现有变换器架构无法直接处理这些特征图输入，导致对位置敏感的人体结构信息进行非自然的展平处理。此外，近期姿态估计与网格重建方法中的性能提升常以计算量和内存需求的持续增长为代价。为同时解决上述问题，我们提出FeatER——一种新颖的变换器设计，可在建模注意力机制时保留特征图表示的内在结构，同时降低内存与计算开销。基于FeatER，我们构建了面向二维人体姿态估计、三维人体姿态估计及人体网格重建的高效网络，并通过特征图重建模块提升人体姿态与网格估计性能。大量实验证明了FeatER在多种人体姿态与网格数据集上的有效性。例如，在Human3.6M和3DPW数据集上，FeatER以仅需MeshGraphormer方法5%的参数量和16%的MACs实现了更优性能。项目网页位于https://zczcwh.github.io/feater_page/。