We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/
翻译:我们提出了一种多模态驱动的高保真长期数字人动画框架,称为$\\textbf{Soul}$,该框架能够从单帧肖像图像、文本提示和音频生成语义连贯的视频,实现精确的唇形同步、生动的面部表情和鲁棒的身份保持。为缓解数据稀缺问题,我们构建了Soul-1M数据集,包含100万个通过精确自动化标注流程(涵盖肖像、上半身、全身及多人场景)精细标注的样本,并精心策划了Soul-Bench基准,用于对音频/文本引导动画方法进行全面且公平的评估。该模型基于Wan2.2-5B主干网络构建,集成了音频注入层、多种训练策略以及阈值感知码本替换技术,以确保长期生成的一致性。同时,采用步长/CFG蒸馏和轻量级VAE来优化推理效率,实现了11.4$\\times$的加速且质量损失可忽略不计。大量实验表明,Soul在视频质量、视频-文本对齐、身份保持和唇形同步准确性方面显著优于当前领先的开源和商业模型,展现了在虚拟主播和影视制作等实际场景中的广泛适用性。项目页面位于https://zhangzjn.github.io/projects/Soul/。