Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs' metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world's languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.

翻译：大型语言模型（LLMs）在语言使用任务上已得到常规评估，但其对语言结构的知识仍鲜为人知。现有的语言基准测试通常关注狭窄的语言现象，侧重高资源语言，且极少评估元语言知识——即对语言结构而非语言使用的显式推理。本文引入MEGA，一个涵盖45种语言（涵盖12个语系和21个语区）的元语言评估基准，包含16种语言特征，覆盖音系学、形态学、句法学和词汇学领域。我们评估了10个LLMs（包括开源和闭源模型）在MEGA上的表现，采用多项选择题格式，要求模型基于语言结构描述选择正确的语言。使用准确率和宏观F1分数，结合多数类和随机基线，我们分析了整体表现，并考察了语言领域和语言相关因素引起的差异。我们的结果表明，当前LLMs的元语言知识有限：GPT-4o表现最佳，但仅达到中等准确率（0.367），而开源模型则落后。所有模型的表现均优于随机基线，但未能超越多数类基线，这表明它们捕捉到了跨语言模式，但缺乏精细的语法区分能力。不同语言领域的表现存在差异，词汇特征准确率最高，音系特征准确率最低，这部分反映了在线可见性的差异。在语言层面，准确率与数字语言地位呈现强关联：数字存在度和资源可用性较高的语言被评估得更准确，而低资源语言的表现则显著较低。预测因素分析证实，资源相关指标（维基百科规模、语料库可用性）比地理、谱系或社会语言因素更能有效预测准确率。综上所述，这些结果表明，LLMs的元语言知识是碎片化的，并由数据可用性所塑造，而非具备跨世界语言的普适性语法能力。我们以开源数据集形式发布此基准，以支持系统性评估，并鼓励未来LLMs实现更大的全球语言多样性。