Existing 2D-to-3D pose lifting networks suffer from poor performance in cross-dataset benchmarks. Although the use of 2D keypoints joined by "stick-figure" limbs has shown promise as an intermediate step, stick-figures do not account for occlusion information that is often inherent in an image. In this paper, we propose a novel representation using opaque 3D limbs that preserves occlusion information while implicitly encoding joint locations. Crucially, when training on data with accurate three-dimensional keypoints and without part-maps, this representation allows training on abstract synthetic images, with occlusion, from as many synthetic viewpoints as desired. The result is a pose defined by limb angles rather than joint positions $\unicode{x2013}$ because poses are, in the real world, independent of cameras $\unicode{x2013}$ allowing us to predict poses that are completely independent of camera viewpoint. The result provides not only an improvement in same-dataset benchmarks, but a "quantum leap" in cross-dataset benchmarks.
翻译:现有的2D到3D姿态提升网络在跨数据集基准测试中表现不佳。尽管使用"火柴人"肢体连接的2D关键点作为中间步骤显示出前景,但火柴人无法处理图像中常有的遮挡信息。本文提出一种新型表示方法,使用不透明3D肢体,在隐式编码关节点位置的同时保留遮挡信息。关键在于,当使用精确三维关键点且无部位映射图的数据进行训练时,该表示方法允许在带遮挡的抽象合成图像上,从任意数量的合成视角进行训练。其结果是姿态由肢体角度而非关节点位置定义——因为在真实世界中,姿态独立于相机——使我们能够预测完全不受相机视角影响的姿态。这不仅提升了同类数据集基准测试性能,更在跨数据集基准测试中实现了"质的飞跃"。