The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long document analysis are quite different from those of shorter texts, while the ever increasing size of documents uploaded on-line renders automated understanding of lengthy texts a critical issue. Relevant applications include automated Web mining, legal document review, medical records analysis, financial reports analysis, contract management, environmental impact assessment, news aggregation, etc. Despite the relatively recent development of efficient algorithms for analyzing long documents, practical tools in this field are currently flourishing. This article serves as an entry point into this dynamic domain and aims to achieve two objectives. Firstly, it provides an overview of the relevant neural building blocks, serving as a concise tutorial for the field. Secondly, it offers a brief examination of the current state-of-the-art in long document NLP, with a primary focus on two key tasks: document classification and document summarization. Sentiment analysis for long texts is also covered, since it is typically treated as a particular case of document classification. Consequently, this article presents an introductory exploration of document-level analysis, addressing the primary challenges, concerns, and existing solutions. Finally, the article presents publicly available annotated datasets that can facilitate further research in this area.
翻译:深度神经网络(DNNs)的应用在过去十年极大地推动了自然语言处理(NLP)的发展。然而,长文档分析的需求与短文本截然不同,而在线文档规模持续增长使得长篇文本的自动化理解成为关键问题。相关应用包括自动化网页挖掘、法律文档审查、医疗记录分析、财务报表分析、合同管理、环境影响评估、新闻聚合等。尽管长文档分析的高效算法相对较晚才得到发展,但该领域的实用工具目前正蓬勃发展。本文作为这一动态领域的入门指南,旨在实现两个目标。首先,它概述了相关的神经构建模块,为该领域提供简洁的入门教程。其次,它简要审视了当前长文档NLP的最新技术,重点关注两个核心任务:文档分类与文档摘要。长文本情感分析也被纳入讨论,因其通常被视为文档分类的特例。因此,本文对文档级分析进行了探索性介绍,阐述了主要挑战、关注点及现有解决方案。最后,本文公开了可用于促进该领域进一步研究的标注数据集。