Text summarization is a downstream natural language processing (NLP) task that challenges the understanding and generation capabilities of language models. Considerable progress has been made in automatically summarizing short texts, such as news articles, often leading to satisfactory results. However, summarizing long documents remains a major challenge. This is due to the complex contextual information in the text and the lack of open-source benchmarking datasets and evaluation frameworks that can be used to develop and test model performance. In this work, we use ChatGPT, the latest breakthrough in the field of large language models (LLMs), together with the extractive summarization model C2F-FAR (Coarse-to-Fine Facet-Aware Ranking) to propose a hybrid extraction and summarization pipeline for long documents such as business articles and books. We work with the world-renowned company getAbstract AG and leverage their expertise and experience in professional book summarization. A practical study has shown that machine-generated summaries can perform at least as well as human-written summaries when evaluated using current automated evaluation metrics. However, a closer examination of the texts generated by ChatGPT through human evaluations has shown that there are still critical issues in terms of text coherence, faithfulness, and style. Overall, our results show that the use of ChatGPT is a very promising but not yet mature approach for summarizing long documents and can at best serve as an inspiration for human editors. We anticipate that our work will inform NLP researchers about the extent to which ChatGPT's capabilities for summarizing long documents overlap with practitioners' needs. Further work is needed to test the proposed hybrid summarization pipeline, in particular involving GPT-4, and to propose a new evaluation framework tailored to the task of summarizing long documents.
翻译:文本摘要是自然语言处理(NLP)中的一项下游任务,对语言模型的理解与生成能力构成挑战。尽管自动摘要技术在短文本(如新闻文章)领域已取得显著进展并常能获得令人满意的结果,但长文档摘要仍面临重大挑战。这主要源于文本中复杂的上下文信息,以及缺乏可用于开发与测试模型性能的开源基准数据集与评估框架。本研究利用大型语言模型(LLMs)领域的最新突破性成果ChatGPT,结合抽取式摘要模型C2F-FAR(基于粗粒度到细粒度方面感知排序),提出面向商业文章和书籍等长文档的混合式抽取-生成摘要流水线。我们与全球知名企业getAbstract AG合作,借助其在专业书籍摘要领域的专长与经验。实践研究表明:在现行自动评估指标下,机器生成摘要的质量至少可与人工摘要相当。然而,通过人工评估对ChatGPT生成文本的深入分析发现,其在文本连贯性、忠实度与风格方面仍存在关键问题。总体而言,我们的结果表明:将ChatGPT用于长文档摘要是极具前景但尚未成熟的方法,至多能作为人类编辑的灵感来源。我们预期本研究将帮助NLP研究者了解ChatGPT的长文档摘要能力在多大程度上满足实践需求。未来仍需开展进一步工作:测试本文提出的混合式摘要流水线(特别是引入GPT-4),并针对长文档摘要任务提出新型评估框架。