SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.

翻译：由一个或少数源肖像生成高质量的四维头部化身，是实现远程临场、增强现实/虚拟现实（AR/VR）及数字人交互的核心技术。三维高斯泼溅（3DGS）已成为主流表征方法，其两条互补路径（可泛化的前馈预测器与逐对象精化器）正并行成熟。然而，现有前馈预测器仅在单一数据集族上训练且预设固定源图像数量，继承了相应领域的偏置。逐对象精化器需要30万至60万次迭代，并依赖自适应稠密化操作破坏上游高斯布局，导致两种路径无法端到端共享表征。为桥接这两条路径，我们提出基于共享FLAME网格绑定高斯表征的SpatialAvatar-0：包含无参数K源均值池化的前馈生成器，以及从单目时序到多视角空间的两阶段调度策略，防止身份先验在较小多视角数据集上坍塌。我们进一步引入仅需1万次迭代的布局保持精化循环，冻结FLAME绑定与高斯数量，并用三分量抗尖峰正则化替代稠密化操作。在VFHQ/HDTF跨域零样本测试中，尽管从未在任一测试域上训练，我们仍以+1.5 dB峰值信噪比（PSNR）超越域内领先方法GAGAvatar；在SplattingAvatar单目基准测试中，我们以较常见最优基线方法高达60倍的逐对象调度加速，全面领先所有已报告指标，并以+1.3 dB PSNR超越需30万次迭代的GeoAvatar。网站：https://spatialwalk.github.io/SpatialAvatar-0。