We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, ``ViCo'' for independent talking and listening head generation tasks at the sentence level, and ``ViCo-X'', for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners' behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation. Project page: https://vico.solutions/.
翻译:我们提出了一种新的对话头部生成基准,用于合成面对面对话中单个对话者的行为。能够自动合成参与长时段、多轮对话的对话者至关重要,并为包括数字人、虚拟代理和社交机器人在内的多种应用提供益处。现有研究主要关注讲话头部生成(单向交互),由于缺少聆听和交互部分,阻碍了为对话(双向)交互创建数字人的能力。在本工作中,我们构建了两个数据集解决此问题:用于句子级别独立讲话和聆听头部生成任务的“ViCo”,以及用于多轮对话场景中合成对话者的“ViCo-X”。基于ViCo和ViCo-X,我们定义了三个针对面对面对话中交互建模的新任务:1)响应式聆听头部生成,使聆听者通过非语言信号主动回应讲话者;2)表达性讲话头部生成,引导讲话者感知聆听者的行为;3)对话头部生成,将讲话/聆听能力集成于同一对话者中。伴随数据集,我们还针对上述三个任务提出了相应的基线解决方案。实验结果表明,我们的基线方法能够生成响应式且生动的代理,与真人协作完成整个对话。项目页面:https://vico.solutions/。