Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Junxuan Li,Rawal Khirodkar,Chengan He,Zhongshi Jiang,Giljoo Nam,Lingchen Yang,Jihyun Lee,Egor Zakharov,Zhaoen Su,Rinat Abdrashitov,Yuan Dong,Julieta Martinez,Kai Li,Qingyang Tan,Takaaki Shiratori,Matthew Hu,Peihong Guo,Xuhua Huang,Ariyan Zarei,Marco Pesavento,Yichen Xu,He Wen,Teng Deng,Wyatt Borsos,Anjali Thakrar,Jean-Charles Bazin,Carsten Stoll,Ginés Hidalgo,James Booth,Lucy Wang,Xiaowen Ma,Yu Rong,Sairanjith Thalanki,Chen Cao,Christian Häne,Abhishek Kar,Sofien Bouaziz,Jason Saragih,Yaser Sheikh,Shunsuke Saito

from arxiv, Accepted in CVPR2026. Website: https://junxuan-li.github.io/lca

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

翻译：高质量三维虚拟化身建模面临保真度与泛化能力之间的关键权衡。一方面，多视角影棚数据能够以精细的表情和姿态控制实现人体高保真建模，但由于数据规模有限且影棚环境与现实世界存在领域差异，难以泛化至真实数据。另一方面，近期基于数百万野外样本训练的大规模虚拟化身模型在跨身份泛化方面展现出潜力，但因三维歧义性导致生成的虚拟化身质量较低。为解决这一问题，我们提出大规模编解码器虚拟化身（LCA）——一种高保真全身三维虚拟化身模型，能以前馈方式泛化至世界级人群规模，支持高效推理。受大型语言模型与视觉基础模型成功的启发，我们首次提出面向大规模三维虚拟化身建模的预训练/后训练范式：首先在100万段野外视频上进行预训练，学习外观与几何的广泛先验；随后在高质量精选数据上进行后训练，以增强表现力与保真度。LCA能够泛化至不同发型、服饰与人群特征，同时提供精细的面部表情与手指级关节控制，并具备强大的身份保持能力。值得注意的是，尽管缺乏直接监督，我们观察到模型对非约束输入展现出重光照与宽松衣物支持的涌现式泛化能力，并对风格化图像具有零样本鲁棒性。