Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient upkeep, data inaccessibility, limited size, and the absence of proper categorization, which pose significant challenges to effective research. To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the MultiSum dataset. Our new dataset features (1) Human-validated summaries for both video and textual content, providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization, spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess varied tasks and methods, including video temporal segmentation, video summarization, text summarization, and multimodal summarization. To champion accessibility and collaboration, we release the MultiSum dataset and the data collection tool as fully open-source resources, fostering transparency and accelerating future developments. Our project website can be found at https://multisum-dataset.github.io/.
翻译:多模态输出式多模态摘要(MSMO)已成为一个具有前景的研究方向。然而,现有公开MSMO数据集存在诸多局限,包括维护不足、数据不可用、规模有限以及缺乏合理分类,这些障碍对有效研究构成了严峻挑战。为应对这些挑战并为此新方向提供综合性数据集,我们精心构建了MultiSum数据集。该新数据集具备以下特点:(1)经人工验证的视频与文本内容摘要,为多模态学习提供优质的人工指令与标注;(2)全面且细致的分类体系,涵盖17个主类别与170个子类别,以囊括多样化的现实场景;(3)在所提数据集上进行的基准测试,评估了包括视频时间分割、视频摘要、文本摘要及多模态摘要等多种任务与方法。为促进可访问性与协作,我们以全开源形式发布MultiSum数据集及数据采集工具,增强研究透明度并加速未来发展。项目网站请见https://multisum-dataset.github.io/。