Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
翻译:维基百科因其公认的高质量和广泛的语言覆盖范围,已成为多语言自然语言处理领域的基础资源。然而,在低资源语言的语境下,这些关于质量的假设正日益受到审视。本文通过应用多种质量过滤技术,对非英语环境下的维基百科数据质量进行了批判性考察,揭示了诸如高比例的单行文章和重复文章等普遍存在的问题。我们评估了质量过滤对维基百科下游应用的影响,发现数据质量剪枝是一种在不损害性能的前提下进行资源高效训练的有效手段,尤其对于低资源语言而言。此外,我们主张将视角从寻求数据质量的通用定义,转向更侧重于语言和任务特定性的定义。最终,我们希望本研究能为在多语言环境中使用维基百科进行预训练提供指导。