Recent advancements in deep learning methods have significantly improved the performance of 3D Human Pose Estimation (HPE). However, performance degradation caused by domain gaps between source and target domains remains a major challenge to generalization, necessitating extensive data augmentation and/or fine-tuning for each specific target domain. To address this issue more efficiently, we propose a novel canonical domain approach that maps both the source and target domains into a unified canonical domain, alleviating the need for additional fine-tuning in the target domain. To construct the canonical domain, we introduce a canonicalization process to generate a novel canonical 2D-3D pose mapping that ensures 2D-3D pose consistency and simplifies 2D-3D pose patterns, enabling more efficient training of lifting networks. The canonicalization of both domains is achieved through the following steps: (1) in the source domain, the lifting network is trained within the canonical domain; (2) in the target domain, input 2D poses are canonicalized prior to inference by leveraging the properties of perspective projection and known camera intrinsics. Consequently, the trained network can be directly applied to the target domain without requiring additional fine-tuning. Experiments conducted with various lifting networks and publicly available datasets (e.g., Human3.6M, Fit3D, MPI-INF-3DHP) demonstrate that the proposed method substantially improves generalization capability across datasets while using the same data volume.
翻译:深度学习方法的近期进展显著提升了三维人体姿态估计的性能。然而,源域与目标域之间的域差异导致的性能下降仍是泛化面临的主要挑战,通常需要对每个特定目标域进行大量数据增强和/或微调。为更高效地解决此问题,我们提出一种新颖的规范域方法,将源域与目标域共同映射至统一的规范域,从而避免在目标域中进行额外微调。为构建规范域,我们引入规范化过程以生成一种新颖的规范二维-三维姿态映射,该映射能确保二维-三维姿态一致性并简化二维-三维姿态模式,从而更高效地训练提升网络。两个域的规范化通过以下步骤实现:(1) 在源域中,提升网络在规范域内进行训练;(2) 在目标域中,利用透视投影特性及已知相机内参,在推理前对输入的二维姿态进行规范化处理。因此,训练完成的网络可直接应用于目标域而无需额外微调。使用多种提升网络及公开数据集(如Human3.6M、Fit3D、MPI-INF-3DHP)进行的实验表明,所提方法在保持相同数据量的情况下,显著提升了跨数据集的泛化能力。