Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently. However, existing corpora in this domain share three limitations: (1) annotated in a scene-oriented fashion, they ignore the coherence within plots; (2) their text lacks empathy and seldom mentions situational context; (3) their video clips fail to cover long-form relationship due to short duration. To address these fundamental issues, using 1,106 TV drama episodes and 24,875 informative plot-focused sentences written by professionals, with the help of 449 human annotators, we constructed PTVD, the first plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training. Next, aiming to open-source a strong baseline for follow-up works, we developed the multimodal algorithm that attacks different cinema/TV modelling problems with a unified architecture. Extensive experiments on three cognitive-inspired tasks yielded a number of novel observations (some of them being quite counter-intuition), further validating the value of PTVD in promoting multimodal research. The dataset and codes are released at \url{https://ptvd.github.io/}.
翻译:电影和电视剧等艺术形式是现实世界的反映,近年来引起了多模态学习社区的广泛关注。然而,现有该领域语料库存在三个局限:(1)以场景为导向进行标注,忽略了情节内部的连贯性;(2)文本缺乏同理心,很少提及情境背景;(3)视频剪辑由于持续时间短,无法涵盖长程的关系。为解决这些根本问题,我们利用1,106集电视剧集和专业人员撰写的24,875句信息丰富的剧情导向句子,在449名人工标注者的帮助下,构建了PTVD——电视剧领域首个情节导向的多模态数据集,也是首个同类非英语数据集。此外,PTVD包含超过2600万条弹幕评论,支持大规模预训练。接下来,为了开源一个强大的基线以促进后续工作,我们开发了多模态算法,该算法采用统一架构处理不同的影视建模问题。在三个认知启发的任务上进行的广泛实验产生了许多新颖的发现(其中一些相当反直觉),进一步验证了PTVD在推动多模态研究方面的价值。数据集和代码发布于\url{https://ptvd.github.io/}。