The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long document analysis are quite different from those of shorter texts, while the ever increasing size of documents uploaded on-line renders automated understanding of long texts a critical area of research. This article has two goals: a) it overviews the relevant neural building blocks, thus serving as a short tutorial, and b) it surveys the state-of-the-art in long document NLP, mainly focusing on two central tasks: document classification and document summarization. Sentiment analysis for long texts is also covered, since it is typically treated as a particular case of document classification. Thus, this article concerns document-level analysis. It discusses the main challenges and issues of long document NLP, along with the current solutions. Finally, the relevant, publicly available, annotated datasets are presented, in order to facilitate further research.
翻译:在过去十年中,深度神经网络的应用极大地促进了自然语言处理领域的发展。然而,长文档分析的需求与短文本分析存在显著差异,而在线文档规模的持续增长使得长文本的自动理解成为关键研究领域。本文具有两个目标:a) 概述相关神经构建模块,从而作为简短教程;b) 综述长文档自然语言处理的最新进展,重点关注两个核心任务:文档分类与文档摘要。长文本情感分析也纳入讨论范围,因其通常被视为文档分类的特例。因此,本文涉及文档级分析,探讨长文档自然语言处理的主要挑战、问题及现有解决方案。最后,为促进后续研究,本文还介绍了相关公开可用的人工标注数据集。