This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
翻译:本研究探讨了大语言模型(LLMs)在多大程度上捕捉了西班牙语的地理词汇变异,该语言存在显著的地区性差异。通过将LLMs视为虚拟信息提供者,我们采用两种调查式问题格式——是非题与多选题——来探测其方言知识。为此,我们利用了一个大规模、由专家整理的西班牙语词汇变异数据库。我们的评估覆盖了21个西班牙语国家的900多个词汇项,并在国家及方言区域两个层面进行。在所有评估格式中,结果均显示LLMs在表征西班牙语变体时存在系统性差异:与西班牙、赤道几内亚、墨西哥及中美洲、拉普拉塔河流域相关的词汇变异被模型更准确地识别,而智利变体对模型而言尤其难以区分。重要的是,国家层面数字资源数量的差异并不能解释这些性能模式,这表明数据规模之外的因素塑造了LLMs中的方言表征。通过提供对地理词汇变异进行细粒度、大规模评估,本研究推进了对LLMs方言知识的实证理解,并为西班牙语数字语言偏见的讨论提供了新证据。