Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
翻译:讲话头生成技术从静态肖像创建逼真的虚拟角色,用于虚拟通信和内容创作。然而,当前模型尚未能传达真正交互式通信的体验,其生成的一维响应往往缺乏情感投入。我们识别出实现真正交互式化身的两项关键挑战:在因果约束条件下实时生成动作,以及在不依赖额外标注数据的情况下学习富有表现力与活力的反应。为应对这些挑战,我们提出Avatar Forcing,一种通过扩散迫使机制模拟实时用户-化身交互的新型交互式头部化身生成框架。该设计使化身能够以低延迟处理实时多模态输入(包括用户音频与动作),对言语与非言语线索(如语音、点头、笑声等)做出即时反应。此外,我们引入一种直接偏好优化方法,利用通过丢弃用户条件构建的合成负样本,实现无标签的交互表现力学习。实验结果表明,我们的框架能够以约500ms的低延迟实现实时交互,相较基准实现6.8倍加速,并生成富有反应性与表现力的化身动作,相较于基准获得超过80%的偏好率。