Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets -- consistently built from scholar resources -- covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models -- two orthogonal approaches -- and obtain state-of-the-art results, showing the importance of combining both lines of research.
翻译:文本摘要是自然语言处理领域一项热门任务且为活跃研究方向。该任务本质上要求处理长输入文本,这一特性对神经模型构成计算挑战。此外,现实文档呈现多样化复杂视觉丰富布局,此类信息对突出显著内容或编码文本段落间的长距离交互具有重要价值。然而,所有公开可用的摘要数据集仅提供纯文本内容。为促进如何利用视觉/布局信息以更好捕获摘要模型中长距离依赖关系的研究,我们提出LoRaLay——一个包含配套视觉/布局信息的长距离摘要数据集集合。我们为现有流行英文数据集(arXiv与PubMed)补充了布局信息,并提出四个全新数据集(均基于学术资源系统构建),覆盖法语、西班牙语、葡萄牙语与韩语。进一步,我们提出融合布局感知与长距离模型(两种正交方法)的新基线,取得当前最优结果,验证了结合这两类研究路线的重要性。