We present a novel framework for reconstructing animatable human avatars from multiple images, termed CanonicalFusion. Our central concept involves integrating individual reconstruction results into the canonical space. To be specific, we first predict Linear Blend Skinning (LBS) weight maps and depth maps using a shared-encoder-dual-decoder network, enabling direct canonicalization of the 3D mesh from the predicted depth maps. Here, instead of predicting high-dimensional skinning weights, we infer compressed skinning weights, i.e., 3-dimensional vector, with the aid of pre-trained MLP networks. We also introduce a forward skinning-based differentiable rendering scheme to merge the reconstructed results from multiple images. This scheme refines the initial mesh by reposing the canonical mesh via the forward skinning and by minimizing photometric and geometric errors between the rendered and the predicted results. Our optimization scheme considers the position and color of vertices as well as the joint angles for each image, thereby mitigating the negative effects of pose errors. We conduct extensive experiments to demonstrate the effectiveness of our method and compare our CanonicalFusion with state-of-the-art methods. Our source codes are available at https://github.com/jsshin98/CanonicalFusion.
翻译:我们提出了一种从多幅图像重建可动画人体化身的新框架,称为CanonicalFusion。我们的核心思想是将各个重建结果整合到规范空间中。具体而言,我们首先使用一个共享编码器-双解码器网络预测线性混合蒙皮(LBS)权重图和深度图,从而能够直接从预测的深度图对三维网格进行规范化。在此过程中,我们并非预测高维蒙皮权重,而是在预训练MLP网络的辅助下推断压缩的蒙皮权重,即三维向量。我们还引入了一种基于前向蒙皮的可微分渲染方案,以融合来自多幅图像的重建结果。该方案通过前向蒙皮对规范网格进行姿态调整,并最小化渲染结果与预测结果之间的光度误差和几何误差,从而优化初始网格。我们的优化方案考虑了顶点的位置和颜色以及每幅图像的关节角度,从而减轻了姿态误差的负面影响。我们进行了大量实验以证明我们方法的有效性,并将我们的CanonicalFusion与最先进的方法进行了比较。我们的源代码可在 https://github.com/jsshin98/CanonicalFusion 获取。