Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. MEGA enables us to propose multiple outputs and to evaluate the uncertainty of the predictions. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
翻译:从单张RGB图像中恢复人体网格是一个高度模糊的问题,因为相似的二维投影可能对应多种三维解释。然而,大多数人体网格恢复方法忽视了这种模糊性,仅做出单一预测而未考虑相关的不确定性。少数方法能够生成人体网格的分布,从而支持对多个预测结果进行采样;但在进行单一预测时,这些方法均无法与最新的单输出模型竞争。本研究提出了一种基于掩码生成建模的新方法。通过对人体姿态和形状进行标记化处理,我们将人体网格恢复任务定义为基于输入图像生成离散标记序列。我们提出了MEGA(掩码生成自编码器),该模型经过训练能够从图像及部分人体网格标记序列中恢复人体网格。给定一张图像,我们灵活的生成机制允许在确定性模式下预测单个人体网格,或在随机性模式下生成多个人体网格。MEGA使我们能够提出多个输出结果并评估预测的不确定性。在自然场景基准测试上的实验表明,MEGA在确定性和随机性模式下均实现了最先进的性能,超越了单输出和多输出方法。