Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: (1) It includes movie screenplays, which are longer than scripts of TV episodes. (2) It is twice the size of previous movie screenplay datasets. (3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline.
翻译:电影剧本摘要任务具有挑战性,因为它需要理解长输入上下文以及电影特有的多种元素。大型语言模型在文档摘要方面已展现出显著进展,但它们在处理长输入上下文时仍常面临困难。此外,尽管电视节目转录文本在近年研究中受到关注,电影剧本摘要领域仍未得到充分探索。为促进该领域的研究,我们提出了一个用于电影剧本抽象摘要的新数据集MovieSum。该数据集包含2200个电影剧本及其对应的维基百科情节摘要。我们通过人工方式对电影剧本进行了格式化处理,以呈现其结构要素。与现有数据集相比,MovieSum具有以下显著特征:(1)包含比电视剧本更长的电影剧本;(2)数据规模是既往电影剧本数据集的两倍;(3)提供带有IMDb ID的元数据,便于获取额外外部知识。我们还展示了最新发布的大型语言模型在本数据集上的摘要结果,以提供详细的基准性能。