Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network's detection capability in human pose estimation, further advancing the development of human pose estimation technology.
翻译:近年来,通过将卷积神经网络(CNNs)与金字塔网格对齐反馈循环相结合,三维人体姿态估计的准确性取得了显著提升。此外,通过采用基于Transformer的时序分析架构,计算机视觉领域也实现了创新性突破。鉴于这些进展,本研究旨在对现有的Pymaf网络架构进行深度优化与改进。本文的主要创新点包括:(1)引入基于自注意力机制的Transformer特征提取网络层,以增强对低级特征的捕获能力;(2)通过特征时序融合技术,增强对视频序列中时序信号的理解与捕捉;(3)采用空间金字塔结构实现多尺度特征融合,有效平衡不同尺度间的特征表示差异。本研究得到的新模型PyCAT4在COCO和3DPW数据集上进行了实验验证。结果表明,所提出的改进策略显著增强了网络在人体姿态估计中的检测能力,进一步推动了人体姿态估计技术的发展。