To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at \url{https://ruc-aimind.github.io/projects/TikTalk/}.
翻译:为促进具备多模态情境理解能力的智能拟人化聊天机器人研究,我们提出了一个名为TikTalk的新型视频多模态对话数据集。我们从热门视频分享平台收集了38,000个视频及其下方用户发布的367,000条对话。用户基于观看视频后的多模态体验展开自发对话,这有助于还原真实世界的闲聊场景。与以往多模态对话数据集相比,TikTalk中更丰富的上下文类型带来了更多样化的对话内容,但同时也增加了从复杂多模态信息中捕捉用户兴趣以生成个性化回复的难度。此外,外部知识在本数据集中被更频繁地调用。这些特征揭示了多模态对话模型面临的新挑战。我们定量展示了TikTalk的特性,提出了基于视频的多模态闲聊任务,并评估了多个对话基准模型。实验结果表明,融合大语言模型(LLM)的模型能生成更多样化的回复,而利用知识图谱引入外部知识的模型综合表现最优。然而,现有模型均未能完美解决上述所有挑战,即使具备视觉扩展功能的LLM,未来仍有极大改进空间。本数据集公开访问:\url{https://ruc-aimind.github.io/projects/TikTalk/}。