The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long document analysis are quite different from those of shorter texts, while the ever increasing size of documents uploaded online renders automated understanding of lengthy texts a critical issue. Relevant applications include automated Web mining, legal document review, medical records analysis, financial reports analysis, contract management, environmental impact assessment, news aggregation, etc. Despite the relatively recent development of efficient algorithms for analyzing long documents, practical tools in this field are currently flourishing. This article serves as an entry point into this dynamic domain and aims to achieve two objectives. First of all, it provides an introductory overview of the relevant neural building blocks, serving as a concise tutorial for the field. Secondly, it offers a brief examination of the current state-of-the-art in two key long document analysis tasks: document classification and document summarization. Sentiment analysis for long texts is also covered, since it is typically treated as a particular case of document classification. Consequently, this article presents an introductory exploration of document-level analysis, addressing the primary challenges, concerns, and existing solutions. Finally, it offers a concise definition of "long text/document", presents an original overarching taxonomy of common deep neural methods for long document analysis and lists publicly available annotated datasets that can facilitate further research in this area.
翻译:深度神经网络(DNNs)的应用在过去十年极大地推动了自然语言处理(NLP)的发展。然而,长文档分析的需求与短文本大不相同,而在线文档规模的持续增长使得对长文本的自动化理解成为关键问题。相关应用包括自动化网络挖掘、法律文档审阅、医疗记录分析、财务报告分析、合同管理、环境影响评估、新闻聚合等。尽管用于分析长文档的高效算法研究相对较新,但该领域的实用工具正蓬勃发展。本文旨在为这一动态领域提供入门指引,并达成两个目标:首先,简要概述相关神经构建模块,作为该领域的入门教程;其次,针对长文档分析的两项关键任务——文档分类与文档摘要——系统考察当前最先进技术。此外,由于长文本情感分析通常被视为文档分类的特例,本文亦涵盖该主题。因此,本文对文档级分析进行探索性介绍,阐述主要挑战、关注点及现有解决方案。最后,本文给出"长文本/长文档"的简明定义,提出通用的深度神经网络长文档分析分类体系,并列出公开可用的标注数据集,以促进该领域的进一步研究。