Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models' performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at https://github.com/juletx/BertaQA.
翻译:大型语言模型展现出关于世界的广泛知识,但大多数评估仅限于全球性或英语中心主题。这引发了一个问题:这些模型在与其他文化相关的主题上表现如何,而这些文化在网络上的存在并不那么突出。为填补这一空白,我们引入了BertaQA,一个英语和巴斯克语平行的多项选择知识问答数据集。该数据集包含一个涉及巴斯克文化相关问题的本土子集,以及一个包含更广泛兴趣问题的全球子集。我们发现,即使最先进的大型语言模型在全球主题上表现出色,它们在本土文化知识方面仍存在困难。然而,我们证明,即使以英语查询,在巴斯克语上的持续预训练也能显著提高模型对巴斯克文化的理解能力。据我们所知,这是首次从低资源语言到高资源语言知识迁移的确凿证据。我们的分析揭示了语言与知识之间复杂的相互作用,并表明一些先前的研究发现在重新评估本土主题时并不完全成立。我们的数据集和评估代码已在 https://github.com/juletx/BertaQA 以开放许可提供。