Visual information is central to conversation: body gestures and physical behaviour, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code.
翻译:视觉信息在对话中至关重要:例如,身体姿态和物理行为所传达的意义超越了语言本身。然而,迄今为止,大多数神经对话模型仅局限于文本。我们提出CHAMPAGNE,一种能够考虑视觉上下文的对话生成模型。为训练CHAMPAGNE,我们收集并发布了YTD-18M,一个包含1800万基于视频对话的大规模语料库。YTD-18M从网络视频构建而成:我们数据收集管线的关键是一个预训练语言模型,该模型能将易出错的自动转录文本转换为更清洁的对话格式,同时保持语义不变。人工评估显示,YTD-18M相比先前资源(MMDialog,100万对话)更合理且更具特异性,同时保持了视觉基础性。实验表明:1)CHAMPAGNE能够从YTD-18M中学习进行对话;2)经过微调后,它在四项聚焦于真实世界对话的视觉-语言任务上达到了最先进水平。我们已开源数据、模型及代码。