Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on the Basque language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
翻译:语言模型依赖大规模文本语料库,这些语料库常因质量过滤而无意中排除非标准语言变体,从而降低模型鲁棒性并强化表征偏差。本文提出,语言模型应致力于捕捉语言变异的全谱系(包括方言、历史、非正式等变体),而非仅依赖标准化文本。我们聚焦巴斯克语,构建融合标准文本、社交媒体文本与历史文本的新型语料库,并以三种配置预训练BERnaT编码器系列模型:标准型、多样型及混合型。进一步提出将自然语言理解任务划分为标准子集与多样子集的评估框架,以衡量语言泛化能力。结果表明,基于标准与多样数据联合训练的模型在所有任务类型上的表现均优于仅基于标准语料训练的模型,且未降低标准基准测试的准确率。这些发现突显了语言多样性在构建包容性、可泛化语言模型中的关键作用。