The concept of diversity has received increasing attention in natural language processing (NLP) in recent years. It became an advocated property of datasets and systems, and many measures are used to quantify it. However, it is often addressed in an ad hoc manner, with few explicit justifications of its endorsement and many cross-paper inconsistencies. There have been very few attempts to take a step back and understand the conceptualization of diversity in NLP. To address this fragmentation, we take inspiration from other scientific fields where the concept of diversity has been more thoroughly conceptualized. We build upon Stirling (2007), a unified framework adapted from ecology and economics, which distinguishes three dimensions of diversity: variety, balance, and disparity. We survey over 300 recent diversity-related papers from ACL Anthology and build an NLP-specific framework with 4 perspectives: why diversity is important, what diversity is measured on, where it is measured, and how. Our analysis increases comparability of approaches to diversity in NLP, reveals emerging trends and allows us to formulate recommendations for the field.
翻译:近年来,多样性概念在自然语言处理领域日益受到关注。它已成为数据集与系统建设中被倡导的重要属性,且已有多种度量方法用于其量化。然而,现有研究常以临时性方式处理多样性问题,鲜有对其支持理由的明确论证,且存在大量跨文献不一致现象。目前极少有研究尝试退一步思考,系统理解自然语言处理中多样性的概念化体系。为应对这种碎片化现状,我们借鉴了其他已对多样性概念进行更系统化理论构建的科学领域。基于Stirling(2007)从生态学与经济学领域引入的统一框架——该框架区分了多样性的三个维度:种类、均衡性与差异性,我们系统梳理了ACL Anthology中300余篇近年来的多样性相关文献,构建了包含四个维度的自然语言处理专用分析框架:为何需要多样性(动因)、对何种对象进行度量(对象)、在何种范畴内测量(范畴)以及如何度量(方法)。本分析增强了自然语言处理领域多样性研究方法的可比性,揭示了新兴趋势,并为该领域的发展提出了系统性建议。