Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.
翻译:语音驱动的姿态与面部动画是游戏、虚拟制作及交互媒体中富有表现力的数字虚拟角色的基础。然而,现有方法要么局限于单一模态进行音频-运动对齐,未能充分利用海量人体运动数据的潜力,要么受限于多模态模型的表示能力与吞吐量,难以实现高质量运动生成或实时性能。我们提出UMo,一种面向实时共语虚拟角色的统一稀疏运动建模架构,该架构在统一框架中处理文本、音频与运动标记。通过利用空间稀疏的混合专家框架与时域稀疏、基于关键帧的设计,UMo高效执行实时密集重建,实现面部表情与姿态的时域连贯且高保真动画生成。此外,我们采用多阶段训练策略并辅以针对性音频增强,以提升声学多样性与语义一致性。因此,即使在严格延迟约束下,UMo仍能保持精细的语音-运动对齐。大量定量与定性评估表明,UMo在低延迟与实时性能约束下取得了更优的输出质量,为高保真实时共语虚拟角色提供了实用解决方案。