评估大型语言模型在全球语言中的元语言知识 (Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages)

LLMs are routinely evaluated on language use, yet their explicit knowledge about linguistic structure remains poorly understood. Existing linguistic benchmarks focus on narrow phenomena, emphasize high-resource languages, and rarely test metalinguistic knowledge - explicit reasoning about language structure. We present a multilingual evaluation of metalinguistic knowledge in LLMs, based on the World Atlas of Language Structures (WALS), documenting 192 linguistic features across 2,660 languages. We convert WALS features into natural-language multiple-choice questions and evaluate models across documented languages. Using accuracy and macro F1, and comparing to chance and majority-class baselines, we assess performance and analyse variation across linguistic domains and language-related factors. Results show limited metalinguistic knowledge: GPT-4o performs best but achieves moderate accuracy (0.367), while open-source models lag. Although all models perform above chance, they fail to outperform the majority-class baseline, suggesting they capture broad cross-linguistic patterns but lack fine-grained distinctions. Performance varies by domain, partly reflecting differences in online visibility. At the language level, accuracy correlates with digital language status: languages with greater digital presence and resources are evaluated more accurately, while low-resource languages perform worse. Analysis of predictive factors confirms that resource-related indicators (Wikipedia size, corpus availability) are more informative than geographic, genealogical, or sociolinguistic factors. Overall, LLM metalinguistic knowledge appears fragmented and shaped mainly by data availability, rather than broadly generalizable grammatical competence. We release the benchmark as an open-source dataset to support evaluation across languages and encourage greater global linguistic diversity in future LLMs.

翻译：大型语言模型（LLM）在语言使用方面已得到常规评估，但其关于语言结构的显性知识仍鲜为人知。现有语言学基准测试主要关注狭窄的语言现象，强调高资源语言，且极少测试元语言知识——即对语言结构的显性推理。本研究基于《世界语言结构地图集》（WALS）对LLM的元语言知识进行多语言评估，该地图集记录了2,660种语言中的192项语言特征。我们将WALS特征转化为自然语言多项选择题，并在已记录的语言中对模型进行评估。通过准确率和宏观F1分数，并与随机及多数类基线进行比较，我们评估了模型性能并分析了跨语言领域及语言相关因素的差异。结果显示，LLM的元语言知识有限：GPT-4o表现最佳，但准确率仅为中等水平（0.367），开源模型则表现落后。尽管所有模型的表现均优于随机基线，但未能超越多数类基线，这表明它们捕捉到了广泛的跨语言模式，但缺乏细粒度的区分能力。模型表现因语言领域而异，部分反映了在线可见性的差异。在语言层面，准确率与数字语言地位相关：数字存在感和资源更丰富的语言评估更准确，而低资源语言表现更差。对预测因素的分析证实，与资源相关的指标（维基百科规模、语料库可用性）比地理、谱系或社会语言学因素更具信息量。总体而言，LLM的元语言知识呈现碎片化，且主要受数据可用性影响，而非广泛可推广的语法能力。我们以开源数据集形式发布该基准测试，以支持跨语言评估，并鼓励未来LLM纳入更广泛的全球语言多样性。