Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient maintenance, data inaccessibility, limited size, and the absence of proper categorization, which pose significant challenges. To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the \textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries for both video and textual content, providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization, spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess various tasks and methods, including \textit{video summarization}, \textit{text summarization}, and \textit{multimodal summarization}. To champion accessibility and collaboration, we will release the \textbf{MMSum} dataset and the data collection tool as fully open-source resources, fostering transparency and accelerating future developments. Our project website can be found at~\url{https://mmsum-dataset.github.io/}
翻译:多模态输出下的多模态摘要(MSMO)已成为一个颇具前景的研究方向。然而,现有公开MSMO数据集存在诸多局限,包括维护不足、数据不可获取、规模有限以及缺乏合理分类,这带来了显著挑战。为应对这些挑战,并为这一新方向提供全面的数据集,我们精心整理了\textbf{MMSum}数据集。该新数据集具备以下特点:(1)经人工验证的视频和文本摘要,为多模态学习提供优质的人类指令与标注;(2)全面且精心组织的分类体系,涵盖17个大类和170个子类,以囊括多样化的真实世界场景;(3)在提出的数据集上开展了基准测试,以评估包括\textit{视频摘要}、\textit{文本摘要}和\textit{多模态摘要}在内的各类任务与方法。为推动可访问性与协作精神,我们将以完全开源的方式发布\textbf{MMSum}数据集及数据收集工具,促进透明度并加速未来发展。项目网站见:\url{https://mmsum-dataset.github.io/}