Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR
翻译:逼真的说话头视频生成对于虚拟化身、电影制作和交互系统至关重要。现有方法由于缺乏细粒度的情感控制,难以生成细腻的情感表达。为解决此问题,我们提出一种新颖的两阶段方法(AUHead),旨在从音频中解耦细粒度的情感控制单元——即动作单元(AUs),并实现可控生成。在第一阶段,我们通过时空AU标记化及“情感先于AU”的思维链机制,探索大型音频-语言模型(ALMs)的AU生成能力。该方法旨在从原始语音中解耦AUs,有效捕捉微妙的情感线索。在第二阶段,我们提出一种AU驱动的可控扩散模型,该模型根据AU序列合成逼真的说话头视频。具体而言,我们首先将AU序列映射为结构化的二维面部表征以增强空间保真度,随后在交叉注意力模块中对AU-视觉交互进行建模。为实现灵活的AU质量权衡控制,我们在推理过程中引入AU解耦引导策略,进一步优化生成视频的情感表现力与身份一致性。在基准数据集上的实验结果表明,我们的方法在情感真实感、精确唇形同步及视觉连贯性方面均取得优异性能,显著超越现有技术。项目代码公开于 https://github.com/laura990501/AUHead_ICLR