In this paper, we concern on the bottom-up paradigm in multi-person pose estimation (MPPE). Most previous bottom-up methods try to consider the relation of instances to identify different body parts during the post processing, while ignoring to model the relation among instances or environment in the feature learning process. In addition, most existing works adopt the operations of upsampling and downsampling. During the sampling process, there will be a problem of misalignment with the source features, resulting in deviations in the keypoint features learned by the model. To overcome the above limitations, we propose a convolutional neural network for bottom-up human pose estimation. It invovles two basic modules: (i) Global Relation Modeling (GRM) module globally learns relation (e.g., environment context, instance interactive information) among region of image by fusing multiple stages features in the feature learning process. It combines with the spatial-channel attention mechanism, which focuses on achieving adaptability in spatial and channel dimensions. (ii) Multi-branch Feature Align (MFA) module aggregates features from multiple branches to align fused feature and obtain refined local keypoint representation. Our model has the ability to focus on different granularity from local to global regions, which significantly boosts the performance of the multi-person pose estimation. Our results on the COCO and CrowdPose datasets demonstrate that it is an efficient framework for multi-person pose estimation.
翻译:本文关注多人姿态估计(MPPE)中的自底向上范式。以往大多数自底向上方法尝试在后处理过程中考虑实例关系以区分不同身体部位,但忽略在特征学习过程中对实例或环境之间的关联进行建模。此外,现有工作通常采用上采样和下采样操作,在采样过程中会出现与源特征不对齐的问题,导致模型学习的关键点特征产生偏差。为克服上述局限,我们提出一种用于自底向上人体姿态估计的卷积神经网络。该网络包含两个基础模块:(i)全局关系建模(GRM)模块,通过在特征学习过程中融合多阶段特征,全局学习图像区域的关联(如环境上下文、实例交互信息),并结合空间-通道注意力机制,专注于实现空间与通道维度上的自适应性;(ii)多分支特征对齐(MFA)模块,聚合来自多个分支的特征以对齐融合特征,从而获得精细化的局部关键点表示。我们的模型能够关注从局部到全局区域的不同粒度,显著提升了多人姿态估计的性能。在COCO和CrowdPose数据集上的实验结果表明,该模型是一种高效的多人姿态估计框架。