We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.
翻译:我们提出HeadsUp,一种可扩展的前馈方法,用于从大规模多相机设置中重建高质量的三维高斯头部。该方法采用高效的编码器-解码器架构,将输入视角压缩为紧凑的潜在表示。该潜在表示随后被解码为一组锚定在中性头部模板上的UV参数化三维高斯体。这种UV表示将三维高斯体的数量与输入图像的数量和分辨率解耦,从而支持使用大量高分辨率输入视角进行训练。我们在一个包含超过10,000个对象(比现存的多视角人类头部数据集大一个数量级)的内部数据集上训练并评估了我们的模型。HeadsUp实现了最先进的重建质量,并且无需测试时优化即可泛化到未见过的对象。我们深入分析了模型在不同对象、视角和模型容量下的扩展行为,揭示了质量与计算权衡方面的实用见解。最后,我们通过展示两个下游应用突出了潜在空间的优势:生成新三维对象以及使用表情混合变形对三维头部进行动画化。