Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

Despite great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of redundant 2D pose sequences to learn representative representation for generating one single 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, for 3D human pose estimation in videos to lift a sequence of 2D joint locations to a 3D pose. Specifically, a vanilla Transformer encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce redundancy of the sequence and aggregate information from local context, strided convolutions are incorporated into VTE to progressively reduce the sequence length. The modified VTE is termed as strided Transformer encoder (STE) which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both the full sequence scale and single target frame scale, applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and improves the representation ability of features for the target frame. The proposed architecture is evaluated on two challenging benchmark datasets, Human3.6M and HumanEva-I, and achieves state-of-the-art results with much fewer parameters.

翻译：尽管在视频3D人造图像估算方面取得了巨大进展,但充分利用冗余的 2D 配置序列以学习一个3D 配置的具有代表性的表示。为此,我们提议改进基于变压器的架构,称为 Strided 变压器,用于视频3D 人造图像估算,将2D 组合位置序列提升为 3D 配置。具体地说,采纳了香草变压器编码器(VTE),以模拟2D 构成序列的远距离依赖性。此外,为了减少本地背景的序列和汇总信息的冗余,将四重相交的组合参数纳入VTE,以逐步缩短序列长度。修改的变压器被称为“Stradedd 变压器”编码器,用于视频变压器输出。 STE 不仅有效地将长期信息汇总到全球和地方等级的单一矢量代表制,而且大幅降低计算成本。此外,一个全对调制监督系统的全比对齐和单一目标框架的比重参数,在VTE 和STE 的拟议指标框架中分别对具有可持续性指标性指标框架的交付。