TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation

Estimating the 2D human poses in each view is typically the first step in calibrated multi-view 3D pose estimation. But the performance of 2D pose detectors suffers from challenging situations such as occlusions and oblique viewing angles. To address these challenges, previous works derive point-to-point correspondences between different views from epipolar geometry and utilize the correspondences to merge prediction heatmaps or feature representations. Instead of post-prediction merge/calibration, here we introduce a transformer framework for multi-view 3D pose estimation, aiming at directly improving individual 2D predictors by integrating information from different views. Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion, to fuse cues from both current views and neighboring views. Moreover, we propose the concept of epipolar field to encode 3D positional information into the transformer model. The 3D position encoding guided by the epipolar field provides an efficient way of encoding correspondences between pixels of different views. Experiments on Human 3.6M and Ski-Pose show that our method is more efficient and has consistent improvements compared to other fusion methods. Specifically, we achieve 25.8 mm MPJPE on Human 3.6M with only 5M parameters on 256 x 256 resolution.

翻译：估计每个视图中的 2D 人姿势通常都是校准多视图 3D 3D 显示估计的第一步。但2D 显示探测器的性能存在挑战性的情况,例如隐蔽和倾斜的观察角度。为了应对这些挑战,以往的工程在上极地几何不同观点之间产生点对点对应,并利用对应法将3D 位置信息编码到变异模型中。这里我们引入了多视图 3D 显示估计的变异框架,目的是通过整合不同观点的信息,直接改进个人 2D 预测器。在以往多模式变异器的启发下,我们设计了一个统一的变异器结构,名为 TransFusion,以结合当前观点和相邻观点的导线。此外,我们提出了将3D 定位信息编码到变异器模型中的上。由子字段指导的 3D 位置编码为多种观点之间的编码提供了有效的方法。人类3. 3M 和 Ski-Pose 3D 预测器的实验由以前的多式变异的变码组成。在25M M 中,我们的方法上只有更高效和一致的方法。