Learning to represent three dimensional (3D) human pose given a two dimensional (2D) image of a person, is a challenging problem. In order to make the problem less ambiguous it has become common practice to estimate 3D pose in the camera coordinate space. However, this makes the task of comparing two 3D poses difficult. In this paper, we address this challenge by separating the problem of estimating 3D pose from 2D images into two steps. We use a variational autoencoder (VAE) to find an embedding that represents 3D poses in canonical coordinate space. We refer to this embedding as variational view-invariant pose embedding V-VIPE. Using V-VIPE we can encode 2D and 3D poses and use the embedding for downstream tasks, like retrieval and classification. We can estimate 3D poses from these embeddings using the decoder as well as generate unseen 3D poses. The variability of our encoding allows it to generalize well to unseen camera views when mapping from 2D space. To the best of our knowledge, V-VIPE is the only representation to offer this diversity of applications. Code and more information can be found at https://v-vipe.github.io/.
翻译:从二维(2D)人体图像中学习三维(3D)人体姿态的表征是一个具有挑战性的问题。为了降低问题的模糊性,当前常见做法是在相机坐标系中估计3D姿态。然而,这使得比较两个3D姿态变得困难。本文通过将“从2D图像估计3D姿态”这一问题分解为两个步骤来应对这一挑战。我们采用变分自编码器(VAE)来寻找一个在规范坐标系中表征3D姿态的嵌入表示。我们将此嵌入称为变分视角不变姿态嵌入V-VIPE。利用V-VIPE,我们可以编码2D和3D姿态,并将该嵌入用于下游任务,如检索与分类。我们还可以通过解码器从这些嵌入中估计3D姿态,并生成未见过的3D姿态。我们编码方式的变分特性使其在从2D空间映射时,能够很好地泛化到未见过的相机视角。据我们所知,V-VIPE是当前唯一能提供如此多样化应用的表示方法。代码及更多信息可在 https://v-vipe.github.io/ 获取。