Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
翻译:说话头部生成技术旨在从静态肖像中创建逼真的虚拟人,以支持虚拟交流与内容创作。然而,现有模型尚无法传递真正交互式沟通的感受,其生成通常为单向响应,缺乏情感参与。我们识别了实现真正交互式虚拟人的两个关键挑战:在因果约束下实时生成运动,以及无需额外标注数据即可学习富有表现力、生动活泼的反应。为应对这些挑战,我们提出了头像驱动,一种用于交互式头部虚拟人生成的新框架,该框架通过扩散驱动对实时用户-虚拟人交互进行建模。此设计使得虚拟人能够以低延迟处理实时多模态输入(包括用户的音频与动作),从而对言语与非言语线索(如语音、点头和笑声)做出即时反应。此外,我们引入了一种直接偏好优化方法,该方法利用通过丢弃用户条件构建的合成负样本,实现了富有表现力的交互的无标注学习。实验结果表明,我们的框架支持低延迟(约500毫秒)的实时交互,相比基线实现了6.8倍的加速,并生成了具有反应性和表现力的虚拟人运动,在超过80%的对比中优于基线。