EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

翻译：单目人体姿态估计（HPE）旨在从相机拍摄的单张二维图像中确定人体关节的三维位置。然而，图像中的单个二维点可能对应三维空间中的多个点。通常，二维-三维关系的唯一性通过正交或弱透视相机模型进行近似。在本研究中，我们主张不依赖近似方法，而是采用完整透视相机模型。这涉及估计相机参数并建立精确、无歧义的二维-三维对应关系。为此，我们提出了EPOCH框架，该框架包含两个主要组件：姿态提升网络（LiftNet）与姿态回归网络（RegNet）。LiftNet采用完整透视相机模型以无监督方式精确估计三维姿态，其以二维姿态和相机参数作为输入，并输出对应的三维姿态估计。这些输入来自RegNet，该网络从单张图像出发，提供二维姿态与相机参数的估计值。RegNet仅使用二维姿态数据作为弱监督信号。其内部通过预测三维姿态，并利用估计的相机参数将其投影至二维空间，从而建立无歧义的二维-三维对应关系。实验表明，将姿态提升建模为包含相机参数循环的无监督任务，能够提升对未见数据的泛化能力。我们在Human3.6M和MPI-INF-3DHP数据集上取得了当前最优的三维人体姿态估计结果。代码发布地址：[录用后公开GitHub链接，详见补充材料]。