Recent works based on convolutional encoder-decoder architecture and 3DMM parameterization have shown great potential for canonical view reconstruction from a single input image. Conventional CNN architectures benefit from exploiting the spatial correspondence between the input and output pixels. However, in 3D face reconstruction, the spatial misalignment between the input image (e.g. face) and the canonical/UV output makes the feature encoding-decoding process quite challenging. In this paper, to tackle this problem, we propose a new network architecture, namely the Affine Convolution Networks, which enables CNN based approaches to handle spatially non-corresponding input and output images and maintain high-fidelity quality output at the same time. In our method, an affine transformation matrix is learned from the affine convolution layer for each spatial location of the feature maps. In addition, we represent 3D human heads in UV space with multiple components, including diffuse maps for texture representation, position maps for geometry representation, and light maps for recovering more complex lighting conditions in the real world. All the components can be trained without any manual annotations. Our method is parametric-free and can generate high-quality UV maps at resolution of 512 x 512 pixels, while previous approaches normally generate 256 x 256 pixels or smaller. Our code will be released once the paper got accepted.
翻译:近年来,基于卷积编码器-解码器架构和三维形变模型参数化的方法在从单张输入图像进行标准视角重建方面展现出巨大潜力。传统卷积神经网络通过利用输入与输出像素之间的空间对应关系获益。然而,在三维人脸重建中,输入图像(如人脸)与标准/UV输出之间的空间错位使得特征编解码过程极具挑战性。为解决该问题,本文提出一种新型网络架构——仿射卷积网络,该网络使基于卷积神经网络的方法能够处理空间非对应的输入输出图像,同时保持高保真度输出质量。在我们的方法中,仿射变换矩阵通过特征图每个空间位置的仿射卷积层进行学习。此外,我们采用多分量UV空间表示三维人体头部,包括用于纹理表征的漫反射图、用于几何表征的位置图以及用于恢复真实世界更复杂光照条件的光照图。所有分量均可无需人工标注进行训练。本方法无需参数化,可生成512×512像素的高质量UV图,而此前方法通常仅能生成256×256像素或更小尺寸。论文接收后我们将公开代码。