Visual information is central to conversation: body gestures and facial expressions, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code at https://seungjuhan.me/champagne.
翻译:视觉信息在对话中至关重要:例如,身体姿势和面部表情等非语言线索能够传达超越文字本身的含义。然而,目前大多数神经对话模型仅局限于文本处理。我们提出了CHAMPAGNE,一种能够考虑视觉上下文的对话生成模型。为训练CHAMPAGNE,我们收集并发布了YTD-18M,一个包含1800万个基于视频对话的大规模语料库。YTD-18M源自网络视频:其数据构建流程的核心在于使用预训练语言模型将易出错的自动转录文本转换为更清晰的对话格式,同时保持语义不变。人工评估表明,YTD-18M相比现有资源(MMDialog,100万对话)更具合理性和特异性,同时保持视觉基础。实验表明:1)CHAMPAGNE能够从YTD-18M中学习进行对话;2)经过微调后,它在四项聚焦真实世界对话的视觉-语言任务上达到了最先进水平。我们已在https://seungjuhan.me/champagne开源数据、模型及代码。