Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
翻译:从单张RGB图像中恢复人体网格是一个高度模糊的问题,因为无限种三维解释都能同样好地解释二维观测结果。然而,大多数人体网格恢复方法忽视了这一问题,仅做出单一预测而未考虑这种模糊性。少数方法能生成人体网格的分布,从而支持对多种预测结果进行采样;但在进行单一预测时,这些方法均无法与最新的单输出模型竞争。本研究提出了一种基于掩码生成建模的新方法。通过对人体姿态和形状进行标记化处理,我们将人体网格恢复任务定义为基于输入图像生成离散标记序列。我们提出了MEGA(掩码生成自编码器),该模型经过训练能够从图像及部分人体网格标记序列中恢复人体网格。给定一张图像,我们灵活的生成方案既支持在确定性模式下预测单个人体网格,也支持在随机性模式下生成多个人体网格。在自然场景基准测试上的实验表明,MEGA在确定性与随机性模式下均实现了最先进的性能,超越了单输出与多输出方法。