This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.
翻译:本文研究了语料库创建决策对大规模多语言地理网络语料库的影响。以源自Common Crawl的4270亿词级语料库为起点,采用三种方法提升代表特定语言-国家配对(如新西兰英语)的子语料库质量:(i)独立语言识别系统的一致性验证,(ii)基于哈希的去重处理,以及(iii)基于位置的异常值检测。随后通过语料库相似性度量将各阶段产出结果与基准数据集进行比较,在语言级和国家级层面评估每一步处理的影响。研究旨在理解上游数据清洗决策对下游语料库的影响,重点关注代表性不足的语言和人群。评估表明,每个清洗阶段均能提升子语料库的有效性,但这种提升在不同语言和人群中的分布并不均衡。这一结果揭示了标准语料库构建技术如何意外地造成代表性不足人群的系统性排斥。