Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.
翻译:构建具有照片级真实感且可动画化的全身数字人仍是计算机图形学与视觉领域的长期挑战。近年来,可动画化数字人建模的研究主要沿着两个方向推进:提升动态几何与外观的保真度,或降低计算复杂度以支持在资源受限平台(如VR头显)上的部署。然而,现有方法无法同时实现这两个目标:超高质量数字人通常需要服务器级GPU的大量计算,而轻量化数字人则常受限于有限的表面动态、缺失的外观细节以及明显的伪影。为弥合这一差距,我们提出了一种新颖的可动画化数字人表示——小波引导的多层空间因子化融合变形,并构建了相应的蒸馏流程,将预训练超高质量数字人模型中的运动感知衣物动态与精细外观细节迁移至紧凑高效的表示中。通过将多层小波频谱分解与纹理空间中的低秩结构因子化相结合,我们的方法相较原始高质量教师数字人模型,实现了高达2000倍的计算成本降低和10倍的模型尺寸缩减,同时保留了与教师模型视觉接近的动态与外观细节。与最先进方法的广泛对比表明,我们的方法显著优于专为移动端设计的现有数字人方法,并在渲染质量上达到或超越了大多数仅能运行于服务器端的方法。重要的是,本方法大幅提升了高保真数字人在沉浸式应用中的实用性:在台式PC上可实现超过180 FPS的帧率,并在独立Meta Quest 3设备上以24 FPS实现实时的原生端侧运行性能。