In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.
翻译:本文提出了在线多模态对话响应生成(OMCRG)这一新任务,旨在基于说话者的多模态输入,在线生成同步的言语与非言语听者反馈。OMCRG捕捉自然的二元互动,并在生成音频与听者面部响应对齐方面引入了新的挑战。为应对这些挑战,我们引入文本作为中间模态来连接音频与面部响应。我们提出了OmniResponse,一种自回归生成准确多模态听者响应的多模态大语言模型(MLLM)。OmniResponse利用一个预训练的大语言模型,并增强其两个核心组件:Chrono-Text Markup(精确为生成的文本标记打上时间戳)和TempoVoice(一个可控的在线文本转语音模块,输出与面部响应同步的语音)。为推进OMCRG研究,我们提供了ResponseNet数据集,包含696个详细的二元互动,涵盖同步分屏视频、多通道音频、转录文本及标注的面部行为。在ResponseNet上的综合评估表明,OmniResponse在语义语音内容、视听同步性和生成质量方面均优于基线模型。我们的数据集、代码和模型均已公开。