We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.
翻译:我们提出结构化3D特征(Structured 3D Features),这是一种基于新型隐式3D表示的模型,该模型将对齐像素的图像特征汇集到从参数化统计人体网格表面采样的密集3D点上。这些3D点具有关联语义,并可在三维空间中自由移动,从而实现对目标人物(除身体形状外)的最佳覆盖,进而辅助建模配饰、头发及宽松衣物。基于此,我们构建了一个完整的3D Transformer注意力框架,该框架以单张非约束姿态人体图像为输入,通过端到端半监督训练生成可动画化的3D重建结果,并自带反照率与光照分解,无需后续后处理。实验表明,我们的S3F模型在单目3D重建、反照率与阴影估计等多项任务上均超越此前最先进方法。此外,该方法支持新视角合成、重光照、姿态重定向,并能自然扩展至多输入图像场景(如人物的不同视角或视频中不同姿态的同一视角)。最后,我们展示了该模型在3D虚拟试穿应用中的编辑能力。