Multimedia content, such as advertisements and story videos, exhibit a rich blend of creativity and multiple modalities. They incorporate elements like text, visuals, audio, and storytelling techniques, employing devices like emotions, symbolism, and slogans to convey meaning. While previous research in multimedia understanding has focused mainly on videos with specific actions like cooking, there is a dearth of large annotated training datasets, hindering the development of supervised learning models with satisfactory performance for real-world applications. However, the rise of large language models (LLMs) has witnessed remarkable zero-shot performance in various natural language processing (NLP) tasks, such as emotion classification, question-answering, and topic classification. To bridge this performance gap in multimedia understanding, we propose verbalizing story videos to generate their descriptions in natural language and then performing video-understanding tasks on the generated story as opposed to the original video. Through extensive experiments on five video-understanding tasks, we demonstrate that our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. Further, alleviating a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science, persuasion strategy identification.
翻译:多媒体内容,如广告和故事视频,展现出丰富的创意和多种模态的融合。它们包含文本、视觉、音频和叙事技巧等元素,运用情感、象征和口号等手法来传达意义。尽管以往的多媒体理解研究主要集中在烹饪等特定动作的视频上,但由于缺乏大规模标注训练数据集,开发具有令人满意性能、可应用于实际场景的有监督学习模型面临阻碍。然而,大型语言模型(LLM)的兴起在情感分类、问答和主题分类等各类自然语言处理(NLP)任务中展现了卓越的零样本性能。为弥合多媒体理解中的这一性能差距,我们提出将故事视频语言化,生成其自然语言描述,然后对生成的故事(而非原始视频)执行视频理解任务。通过在五项视频理解任务上的广泛实验,我们证明,尽管采用零样本方法,我们的方法在视频理解方面显著优于有监督基线模型。此外,为缓解故事理解基准的缺乏,我们首次公开提供了计算社会科学中一个关键任务——说服策略识别——的数据集。