We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields a performance rivaling methods that rely on 3D annotated data, while being the state-of-the-art among methods relying only on 2D supervision.
翻译:我们提出UPose3D,一种新颖的多视角三维人体姿态估计方法,旨在解决精度与可扩展性方面的挑战。本方法通过提升鲁棒性和灵活性,改进了现有姿态估计框架,且无需直接的三维标注。该方法的核心是一个姿态编译器模块,该模块利用时序和跨视角信息,优化基于单幅图像运行的二维关键点估计器的预测结果。我们提出的跨视角融合策略可扩展至任意数量的相机,同时,我们的合成数据生成策略确保了在不同角色、场景和视角下的泛化能力。最后,UPose3D利用了二维关键点估计器和姿态编译器模块的预测不确定性。这增强了对异常值和噪声数据的鲁棒性,从而在分布外场景中实现了最先进的性能。此外,在分布内场景中,UPose3D的性能可与依赖三维标注数据的方法相媲美,并且在仅依赖二维监督的方法中达到了最先进的水平。