In this paper, we consider a novel problem of reconstructing a 3D human avatar from multiple unconstrained frames, independent of assumptions on camera calibration, capture space, and constrained actions. The problem should be addressed by a framework that takes multiple unconstrained images as inputs, and generates a shape-with-skinning avatar in the canonical space, finished in one feed-forward pass. To this end, we present 3D Avatar Reconstruction in the wild (ARwild), which first reconstructs the implicit skinning fields in a multi-level manner, by which the image features from multiple images are aligned and integrated to estimate a pixel-aligned implicit function that represents the clothed shape. To enable the training and testing of the new framework, we contribute a large-scale dataset, MVP-Human (Multi-View and multi-Pose 3D Human), which contains 400 subjects, each of which has 15 scans in different poses and 8-view images for each pose, providing 6,000 3D scans and 48,000 images in total. Overall, benefits from the specific network architecture and the diverse data, the trained model enables 3D avatar reconstruction from unconstrained frames and achieves state-of-the-art performance.
翻译:本文研究从多个非约束帧重建三维人体化身的新问题,无需依赖相机标定、捕获空间及动作约束等假设。该问题应由一个框架解决:输入多张非约束图像,通过单次前向传播在规范空间生成具有蒙皮权重的三维化身。为此,我们提出野外三维化身重建方法(ARwild),首先以多层级方式重建隐式蒙皮场,通过对齐并融合多幅图像的图像特征,估计表示着衣形状的像素对齐隐式函数。为支持新框架的训练与测试,我们构建了大规模数据集MVP-Human(多视角多姿态三维人体),包含400名受试者,每位受试者拥有15种不同姿态的扫描数据及每个姿态下的8视角图像,共计6,000个三维扫描模型和48,000张图像。实验表明,借助特殊的网络架构与多样化的数据,该训练模型能够从非约束帧中实现三维人体化身重建,并达到当前最优性能。