We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings. Project page: https://zijian-wu.github.io/uika-page/
翻译:我们提出UIKA,一种基于任意数量无姿态输入(包括单张图像、多视角捕捉图像以及智能手机拍摄视频)的前馈可动画化高斯头部模型。与传统虚拟形象方法需要工作室级多视角捕捉系统并通过长时间优化过程重建特定人物模型不同,我们从模型表示、网络设计和数据准备三个维度重新思考该任务。首先,我们引入UV引导的虚拟形象建模策略,将每张输入图像与像素级面部对应关系估计相关联。这种对应关系估计使我们能够将每个有效像素颜色从屏幕空间重投影到与相机姿态和角色表情无关的UV空间。此外,我们设计了可学习的UV标记,注意力机制可在屏幕和UV两个层级对其施加作用。学习得到的UV标记可利用所有输入视角聚合的UV信息解码为标准高斯属性。为训练我们的大型虚拟形象模型,我们还构建了一个大规模、身份特征丰富的合成训练数据集。我们的方法在单目和多视角设置下均显著优于现有方法。项目页面:https://zijian-wu.github.io/uika-page/