FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.

翻译：我们的目标是基于视频高效学习个性化、可动画的3D头部虚拟化身，使其兼具几何精确性、真实感、可重光照性，并与现有渲染系统兼容。3D网格能实现高效处理且具有高可移植性，但形状与外观的真实感不足；神经表征虽然真实感强，但兼容性差且训练与渲染速度缓慢。我们的核心洞见在于：通过利用传统计算机图形学中高度优化的方法，并结合神经网络近似部分组件，可实现基于可微渲染的高保真3D网格表征的高效学习。为此，我们提出FLARE技术——该技术能从单目视频中创建可动画与可重光照的网格化虚拟化身。首先，我们采用网格表征学习标准几何，实现高效的可微光栅化，并通过学习的混合变形与线性混合蒙皮权重直接驱动动画。其次，我们遵循基于物理的渲染框架，将观测到的颜色分解为固有反照率、粗糙度以及光照的神经表征，使学习到的虚拟化身能在新场景中被重光照。由于输入视频由单个窄视场设备采集，建模周围环境光并非易事。基于镜面反射建模的分裂和近似方法，我们通过用表面粗糙度调制的多层感知机近似预滤波环境贴图来解决这一问题，从而无需显式建模光照。实验表明，与现有方法相比，这种基于网格的虚拟化身公式结合学习的形变、材质与光照多层感知机，不仅能生成具有高质量几何与外观的虚拟化身，还能实现高效的训练与渲染。