Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.
翻译:数字人体化身旨在模拟人类在虚拟环境中的动态外观,为游戏、电影、虚拟现实等领域提供沉浸式体验。然而,创建并驱动逼真人像化身的传统流程成本高昂且耗时,需要大型相机捕捉设备以及专业3D艺术家的大量人工操作。随着高性能图像与视频生成模型的出现,近期方法已能通过单张随意捕捉的目标主体参考图像自动渲染出逼真的动画化身。尽管这些技术显著降低了化身创建门槛并展现出令人信服的真实感,但它们缺乏多视角信息或显式三维表征所提供的约束。因此,当渲染视角与参考图像存在显著偏差时,图像质量与真实感会急剧下降。本文构建了一种视频模型,能够基于单张参考图像与目标表情生成数字人体的可动画多视角视频。我们的模型MVP4D基于最先进的预训练视频扩散模型,可同时生成数百帧围绕目标主体360度视角变化的视频序列。我们展示了如何将该模型的输出蒸馏为能够实时渲染的4D化身。相较于现有方法,本方案在生成化身的真实感、时间一致性与三维一致性方面均取得显著提升。