In this work, we assess the ability of foundation models to recall encyclopedic knowledge across a wide range of linguistic contexts. To support this, we: 1) produce a 20-language dataset that contains 303k factual associations paired with counterfactuals, 2) evaluate 5 models in a multilingual test, and 3) benchmark a diverse set of 24 models in an English-only test. Meta's LLaMA achieves the highest scores in both multilingual and English-only evaluations. Yet, an analysis of LLaMA's errors reveals significant limitations in its ability to recall facts in languages other than English, plus difficulties related to the location and gender of fact subjects. Overall, our findings suggest that today's foundation models are far from polyglots.
翻译:本文系统评估了基础模型在不同语言语境下回忆百科知识的能力。为支持本研究,我们:1)构建了包含20种语言、30.3万条事实关联及对应反事实数据的多语言数据集;2)在多语言测试中评估了5种模型;3)在纯英语测试中对24种不同模型进行了基准评测。Meta的LLaMA模型在多语言和纯英语评估中均取得最高分。然而,对LLaMA错误案例的分析表明,其在非英语语言的事实回忆能力上存在显著局限,且与事实主体的地理位置和性别属性相关的困难尤为突出。总体而言,我们的研究结果表明,当前基础模型远未达到多语言知识通晓的水平。