Exploring and understanding language data is a fundamental stage in all areas dealing with human language. It allows NLP practitioners to uncover quality concerns and harmful biases in data before training, and helps linguists and social scientists to gain insight into language use and human behavior. Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias across multiple variables, language units, and diverse metrics that go beyond descriptive statistics. In this paper, we introduce Variationist, a highly-modular, extensible, and task-agnostic tool that fills this gap. Variationist handles at once a potentially unlimited combination of variable types and semantics across diversity and association metrics with regards to the language unit of choice, and orchestrates the creation of up to five-dimensional interactive charts for over 30 variable type-semantics combinations. Through our case studies on computational dialectology, human label variation, and text generation, we show how Variationist enables researchers from different disciplines to effortlessly answer specific research questions or unveil undesired associations in language data. A Python library, code, documentation, and tutorials are made publicly available to the research community.
翻译:探索和理解语言数据是所有涉及人类语言领域的基础阶段。它使自然语言处理从业者能够在训练前发现数据中的质量问题和有害偏见,并帮助语言学家和社会科学家深入了解语言使用和人类行为。然而,目前缺乏一个统一、可定制的工具,能够无缝地检查和可视化跨越多个变量、语言单元以及超越描述性统计的多样化指标的语言变异和偏见。本文介绍了Variationist,一个高度模块化、可扩展且与任务无关的工具,填补了这一空白。Variationist能够同时处理所选语言单元在多样性和关联性指标方面无限可能的变量类型与语义组合,并为超过30种变量类型-语义组合协调生成多达五维的交互式图表。通过我们在计算方言学、人工标注变异和文本生成方面的案例研究,我们展示了Variationist如何使来自不同学科的研究人员能够轻松回答特定的研究问题,或揭示语言数据中不期望的关联。我们向研究社区公开提供了Python库、代码、文档和教程。