Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
翻译:尽管说话虚拟人生成技术已取得显著进展,现有方法仍面临关键挑战:对多样化动作的文本跟随能力不足、动作与音频内容间缺乏时序对齐,以及依赖姿势骨架等额外控制信号。本文提出ActAvatar,一个通过捕捉动作语义与时序上下文、实现文本引导下相位级精确动作控制的框架。我们的方法包含三项核心创新:(1) 相位感知交叉注意力机制,将提示词分解为全局基础块与时序锚定的相位块,使模型能聚焦于相位相关词元以实现精确的时序-语义对齐;(2) 渐进式视听对齐策略,将模态影响与层次化特征学习过程相协调——浅层优先处理文本以建立动作结构,深层侧重音频以细化唇部运动,从而避免模态干扰;(3) 两阶段训练策略:首先在多样化数据上建立鲁棒的视听对应关系,随后通过结构化标注的微调注入动作控制,在保持视听对齐的同时维护模型的文本跟随能力。大量实验表明,ActAvatar在动作控制与视觉质量方面均显著优于现有先进方法。