We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/
翻译:我们提出一个框架,用于在二元社交互动中,基于说话者的话语生成听话者适当的面部反应。给定说话者话语的带时间戳转录文本,我们的方法能自回归地预测听话者的反应:即听话者面部手势的序列,该序列通过VQ-VAE进行量化。由于手势是语言的一部分,我们提出将量化的原子运动元素作为额外的语言标记输入到基于Transformer的大语言模型中。用仅在文本上预训练的语言模型权重来初始化我们的Transformer,相比从头训练Transformer,能产生质量显著更高的听话者反应。通过定量指标和定性用户研究,我们证明生成的听话者动作流畅且能够反映语言语义。在评估中,我们分析了模型利用口语文本时间和语义方面的能力。项目页面:https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/