The increasing volume of textual data poses challenges in reading and comprehending large documents, particularly for scholars who need to extract useful information from research articles. Automatic text summarization has emerged as a powerful tool to condense lengthy documents into concise and informative summaries. Depending on the approach used, text summarization can be categorized as either extractive or abstractive. While extractive methods are commonly used due to their simplicity, they often miss important information. On the other hand, Abstractive Summarization can generate more coherent and informative summaries by understanding the underlying meaning of the text. Abstractive techniques have gained attention in various languages, and recent advancements have been achieved through pre-training models such as BERT, BART, and T5. However, the challenge of summarizing long documents remains, and alternative models like Longformer have been introduced to address this limitation. In this context, this paper focuses on abstractive summarization in the Persian language. The authors introduce a new dataset of 300,000 full-text Persian papers obtained from the Ensani website and apply the ARMAN model, based on the Longformer architecture, to generate summaries. The experimental results demonstrate promising performance in Persian text summarization. The paper provides a comprehensive overview of related work, discusses the methodology, presents the experimental results, and concludes with future research directions.
翻译:文本数据量的日益增长给阅读和理解大型文档带来了挑战,特别是对于需要从研究论文中提取有用信息的学者而言。自动文本摘要已成为将冗长文档压缩为简洁且信息丰富的摘要的强大工具。根据所采用的方法,文本摘要可分为抽取式与抽象式两类。尽管抽取式方法因其简单性而被广泛使用,但它们常常遗漏重要信息。另一方面,抽象式摘要通过理解文本的深层含义,能够生成更具连贯性和信息量的摘要。抽象式技术已在多种语言中获得关注,并且通过BERT、BART和T5等预训练模型取得了最新进展。然而,长文档摘要的挑战依然存在,为此引入了如Longformer等替代模型以应对这一局限。在此背景下,本文聚焦于波斯语的抽象式摘要。作者引入了一个包含30万篇来自Ensani网站全文波斯语论文的新数据集,并应用基于Longformer架构的ARMAN模型生成摘要。实验结果表明该方法在波斯语文本摘要中展现出良好的性能。本文全面综述了相关工作,讨论了方法学,展示了实验结果,并提出了未来的研究方向。