Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion.The applications of listener agent generation in virtual interaction have promoted many works achieving the diverse and fine-grained motion generation. However, they can only manipulate motions through simple emotional labels, but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity, personality) which can be freely customized by users, this limits their realism. In this paper, we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination, we design a Static to Dynamic Portrait module (SDP), which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments, we design a Past Guided Generation Module (PGG) to maintain the consistency of customized listener attributes through the motion prior, and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model, we have constructed two text-annotated listening head datasets based on ViCo and RealTalk, which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model.
翻译:倾听头部生成旨在通过建模说话者与倾听者在动态转换中的相关性,合成非语言响应式倾听者头部。倾听者智能体在虚拟交互中的应用已推动众多工作实现了多样化且细粒度的运动生成。然而,它们仅能通过简单情感标签操控运动,而无法自由控制倾听者的动作。由于倾听者智能体应具备可被用户自由定制的类人属性(如身份、性格),这限制了其真实感。本文提出一个名为CustomListener的用户友好型框架,以实现自由文本先验引导的倾听者生成。为达成说话者-倾听者协调,我们设计了静态到动态肖像模块(SDP),该模块通过交互说话者信息将静态文本转化为具备完整节奏与幅度信息的动态肖像令牌。为实现片段间一致性,我们设计了过往引导生成模块(PGG),通过运动先验维持定制化倾听者属性的连贯性,并利用基于扩散模型的结构,以肖像令牌和运动先验为条件实现可控生成。为训练与评估模型,我们基于ViCo和RealTalk构建了两个带有文本标注的倾听头部数据集,提供文本-视频配对标签。大量实验验证了模型的有效性。