Automatic document summarization aims to produce a concise summary covering the input document's salient information. Within a report document, the salient information can be scattered in the textual and non-textual content. However, existing document summarization datasets and methods usually focus on the text and filter out the non-textual content. Missing tabular data can limit produced summaries' informativeness, especially when summaries require covering quantitative descriptions of critical metrics in tables. Existing datasets and methods cannot meet the requirements of summarizing long text and multiple tables in each report. To deal with the scarcity of available data, we propose FINDSum, the first large-scale dataset for long text and multi-table summarization. Built on 21,125 annual reports from 3,794 companies, it has two subsets for summarizing each company's results of operations and liquidity. To summarize the long text and dozens of tables in each report, we present three types of summarization methods. Besides, we propose a set of evaluation metrics to assess the usage of numerical information in produced summaries. Dataset analyses and experimental results indicate the importance of jointly considering input textual and tabular data when summarizing report documents.
翻译:自动文档摘要旨在生成涵盖输入文档中重要信息的简洁摘要。在报告文档中,重要信息可能分散在文本与非文本内容中。然而,现有的文档摘要数据集和方法通常仅聚焦于文本部分,忽略了非文本内容。缺失表格数据会限制所生成摘要的信息量,尤其是当摘要需要涵盖对表格中关键指标的定量描述时。现有数据集和方法无法满足对每份报告中长文本与多个表格进行摘要的需求。为解决可用数据稀缺的问题,我们提出了FINDSum——首个面向长文本与多表格摘要的大规模数据集。该数据集基于来自3794家公司的21125份年度报告构建,包含两个子集,分别用于摘要每家公司的经营成果与流动性状况。为摘要每份报告中的长文本和数十个表格,我们提出了三类摘要方法。此外,我们设计了一套评估指标以衡量生成摘要中数值信息的利用情况。数据集分析与实验结果表明,在摘要报告文档时,联合考虑输入文本与表格数据的重要性。