ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.

翻译：在自然的面部交互中，参与者流畅地在说话与倾听之间切换，产生由长程上下文精细引导的面部行为，这些行为自然地展现出上下文适宜性与情感合理性。交互式头部生成旨在合成能够模拟此类能力的逼真化身头部视频。现有的IHG方法通常基于短时窗口内的双轨信号（即人类用户的行为与预定义的化身音频）进行条件生成，共同驱动化身音频对齐的唇部动作与非言语面部行为的生成。然而，这些方法仍存在两个主要挑战：（i）依赖短片段行为线索而缺乏长程上下文建模，导致生成的面部行为缺乏上下文适宜性；（ii）双轨信号以纠缠且角色无关的方式融合，经验上会引入跨信号干扰，可能损害说话期间唇部区域的同步性。为此，我们提出了ECHO，一种新颖的IHG框架，包含两个关键组件：长程上下文理解组件，该组件促进对基于行为的动态与语言驱动的情感语义的上下文理解，以提升合成化身面部行为的上下文适宜性与情感合理性；以及块级空间感知解耦交叉注意力调制模块，该模块在保持自音频驱动的唇部动作的同时，自适应地整合用户上下文行为线索以用于非唇部面部区域，辅以我们设计的两阶段训练范式，共同提升唇部同步与视觉保真度。大量实验证明了所提出组件的有效性以及ECHO在IHG任务上的优越性能。