In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.
翻译:本文提出一种新颖的面对面语音对话模型。该模型直接处理用户输入的视听语音,并生成视听语音作为响应,标志着在不依赖中间文本的情况下创建虚拟化身聊天机器人系统的初步尝试。为此,我们首次构建了MultiDialog——首个大规模多模态(即音频与视觉)语音对话语料库,包含基于开放域对话数据集TopicalChat录制的340小时约9,000段对话。MultiDialog收录了对话双方根据标注情感信息的脚本进行互动的并行视听录音,我们预期这将为多模态合成研究开辟新机遇。我们的面对面语音对话模型整合了经过文本预训练的大语言模型,并通过语音-文本联合预训练技术将其适配至视听语音对话领域。通过大量实验,我们验证了该模型在促进面对面对话方面的有效性。演示与数据分别发布于https://multidialog.github.io 与 https://huggingface.co/datasets/IVLLab/MultiDialog。