MovieSum: An Abstractive Summarization Dataset for Movie Screenplays

Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: (1) It includes movie screenplays, which are longer than scripts of TV episodes. (2) It is twice the size of previous movie screenplay datasets. (3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline.

翻译：电影剧本摘要任务具有挑战性，因为它需要理解长输入上下文以及电影特有的多种元素。大型语言模型在文档摘要方面已展现出显著进展，但它们在处理长输入上下文时仍常面临困难。此外，尽管电视节目转录文本在近年研究中受到关注，电影剧本摘要领域仍未得到充分探索。为促进该领域的研究，我们提出了一个用于电影剧本抽象摘要的新数据集MovieSum。该数据集包含2200个电影剧本及其对应的维基百科情节摘要。我们通过人工方式对电影剧本进行了格式化处理，以呈现其结构要素。与现有数据集相比，MovieSum具有以下显著特征：（1）包含比电视剧本更长的电影剧本；（2）数据规模是既往电影剧本数据集的两倍；（3）提供带有IMDb ID的元数据，便于获取额外外部知识。我们还展示了最新发布的大型语言模型在本数据集上的摘要结果，以提供详细的基准性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日