GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

翻译：我们提出GenLCA，一种基于扩散的生成模型，能够从文本和图像输入生成并编辑逼真的全身虚拟形象。生成的虚拟形象忠实于输入，同时支持高保真的面部和全身动画。其核心思想是一种新颖的范式，使得从部分可观测的2D数据训练全身3D扩散模型成为可能，从而将训练数据集扩展到数百万真实世界视频。这种可扩展性提升了GenLCA的卓越逼真度与泛化能力。具体而言，我们通过复用预训练的前馈虚拟形象重建模型作为可动画化的3D分词器来扩展数据集，该分词器将非结构化视频帧编码为结构化3D标记。然而，大多数真实世界视频仅提供身体部位的部分观测，导致3D标记中出现过度模糊或透明度伪影。为解决此问题，我们提出一种新颖的可见性感知扩散训练策略，用可学习标记替换无效区域并仅对有效区域计算损失。随后我们基于标记数据集训练基于流的扩散模型，内在保留了预训练虚拟形象重建模型所提供的逼真度与可动画性。我们的方法有效实现了利用大规模真实世界视频数据原生于3D训练扩散模型。通过多样且高保真的生成与编辑结果，我们展示了方法的有效性，其性能大幅超越现有解决方案。项目页面见https://onethousandwu.com/GenLCA-Page。