Natural language processing (NLP) researchers develop models of grammar, meaning and communication based on written text. Due to task and data differences, what is considered text can vary substantially across studies. A conceptual framework for systematically capturing these differences is lacking. We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. Towards that goal, we propose common terminology to discuss the production and transformation of textual data, and introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling. We apply this taxonomy to survey existing work that extends the notion of text beyond the conservative language-centered view. We outline key desiderata and challenges of the emerging inclusive approach to text in NLP, and suggest community-level reporting as a crucial next step to consolidate the discussion.
翻译:自然语言处理(NLP)研究者基于书面文本构建语法、意义和交流的模型。由于任务和数据的差异,不同研究中被视为"文本"的内容可能差异显著。目前缺乏用于系统捕捉这些差异的概念框架。我们认为,明确文本概念对于NLP的可重复性和可泛化性至关重要。为此,我们提出通用术语以讨论文本数据的生成与转换,并引入一个双层分类体系,涵盖文本资源中可用于NLP建模的语言学与非语言学要素。通过该分类体系,我们调研了现有超越传统语言中心观来扩展文本概念的研究工作,概述NLP中新兴的包容性文本方法的关键需求与挑战,并建议将社区层面报告作为巩固该讨论的关键下一步。