In recent years, research in text summarization has mainly focused on the news domain, where texts are typically short and have strong layout features. The task of full-book summarization presents additional challenges which are hard to tackle with current resources, due to their limited size and availability in English only. To overcome these limitations, we present "Echoes from Alexandria", or in shortened form, "Echoes", a large resource for multilingual book summarization. Echoes features three novel datasets: i) Echo-Wiki, for multilingual book summarization, ii) Echo-XSum, for extremely-compressive multilingual book summarization, and iii) Echo-FairySum, for extractive book summarization. To the best of our knowledge, Echoes, with its thousands of books and summaries, is the largest resource, and the first to be multilingual, featuring 5 languages and 25 language pairs. In addition to Echoes, we also introduce a new extractive-then-abstractive baseline, and, supported by our experimental results and manual analysis of the summaries generated, we argue that this baseline is more suitable for book summarization than purely-abstractive approaches. We release our resource and software at https://github.com/Babelscape/echoes-from-alexandria in the hope of fostering innovative research in multilingual book summarization.
翻译:近年来,文本摘要研究主要集中在新闻领域,这类文本通常篇幅较短且具有显著的布局特征。全书摘要任务带来了当前资源难以应对的额外挑战,原因在于现有资源规模有限且仅提供英文版本。为克服这些局限,我们提出"亚历山大的回声"(简称"Echoes")——一种用于多语言书籍摘要的大规模资源。Echoes包含三个新颖数据集:i) Echo-Wiki,用于多语言书籍摘要;ii) Echo-XSum,用于高压缩率的多语言书籍摘要;iii) Echo-FairySum,用于抽取式书籍摘要。据我们所知,Echoes作为拥有数千本书籍及摘要的资源,不仅是规模最大的,也是首个覆盖5种语言及25对语言组合的多语言资源。除Echoes外,我们还引入了一种新的"先抽取后生成"基线方法。基于实验结果及对生成摘要的人工分析,我们论证该方法比纯生成式方法更适用于书籍摘要任务。我们将资源与软件开源发布至https://github.com/Babelscape/echoes-from-alexandria,期待推动多语言书籍摘要领域的创新研究。