High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
翻译:高质量三维化身建模面临着保真度与泛化能力之间的关键权衡。一方面,多视角工作室数据能够实现高保真度的人体建模,并对表情和姿态进行精确控制,但由于数据规模有限,且工作室环境与现实世界之间存在领域差距,难以泛化到真实世界数据。另一方面,近期基于数百万野外样本训练的大规模化身模型在跨身份泛化方面展现出潜力,但生成的化身往往因三维歧义性而质量较低。为解决这一问题,我们提出大规模编解码器化身(LCA)——一种高保真、全感知的三维化身模型,能够以前馈方式泛化到世界范围的人群,并实现高效推理。受大语言模型和视觉基础模型成功经验的启发,我们首次提出了面向大规模三维化身建模的预训练/后训练范式:首先在100万段野外视频上进行预训练,以学习外观和几何的广泛先验知识;然后在高质量策划数据上进行后训练,以增强表现力和保真度。LCA能够泛化到不同发型、服装和人口统计特征,同时提供精确的细粒度面部表情和手指级关节控制,并具备强大的身份保持能力。值得注意的是,尽管缺乏直接监督,我们观察到LCA在无约束输入下展现出照明重映射和松散衣物支持的涌现泛化能力,以及对风格化图像的零样本鲁棒性。