To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at \url{https://ruc-aimind.github.io/projects/TikTalk/}.
翻译:为促进具备多模态理解能力的智能类人聊天机器人研究,我们提出了一种全新的基于视频的多模态对话数据集——TikTalk。我们从某热门视频分享平台采集了38,000个视频及其下方的367,000条用户评论对话。用户基于观看视频的多模态体验自发展开对话,这有助于复现真实世界的闲聊场景。相较于以往的多模态对话数据集,TikTalk中更丰富的上下文类型不仅催生了更多样化的对话,也增加了从复杂多模态信息中捕捉用户兴趣以生成个性化回复的难度。此外,该数据集更频繁地涉及外部知识的调用。这些特征揭示了多模态对话模型面临的新挑战。我们定量验证了TikTalk的数据特性,定义了基于视频的多模态闲聊任务,并评估了多种基线对话模型。实验结果表明,整合大语言模型(LLM)的模型能生成更多样化的回复,而利用知识图谱引入外部知识的模型综合表现最优。然而,现有模型均无法完美应对上述所有挑战,即便是具备视觉扩展能力的LLM,其性能仍有较大提升空间。数据集发布地址:\url{https://ruc-aimind.github.io/projects/TikTalk/}。