We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to first estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed.
翻译:我们提出了一种新颖的方法,用于在给定单张图像的情况下,恢复预扫描场景中人体的绝对姿态与形状。与以往执行场景感知网格优化的方法不同,我们首先通过稀疏三维卷积神经网络估计绝对位置和密集场景接触点,随后利用推导出的三维场景线索,通过交叉注意力机制增强预训练的人体网格恢复网络。通过图像与场景几何的联合学习,我们的方法能够减少由深度和遮挡引起的歧义,从而得到更合理的全局姿态与接触关系。将场景感知线索编码到网络中,使得所提方法无需优化过程,为实时应用提供了可能性。实验表明,该网络能够通过单次前向推理恢复精确且物理合理的人体网格,在准确性和速度方面均优于现有最新方法。