To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
翻译:迄今为止,针对覆盖大量语言与文化的大语言模型(LLMs),几乎不存在文化特异性评估基准。本文提出Global PIQA,这是一个涵盖超过100种语言的参与式常识推理基准,由来自全球65个国家的335名研究人员手工构建。Global PIQA包含的116种语言变体覆盖五大洲、14个语系及23种文字系统。在Global PIQA的非平行数据划分中,超过50%的示例涉及当地食物、习俗、传统或其他文化特异性元素。我们发现,尽管随机猜测准确率为50%,最先进的大语言模型在Global PIQA整体表现良好,但在低资源语言中表现较弱(准确率差距最高达37%)。开源模型普遍表现逊于专有模型。Global PIQA强调,对于许多语言和文化而言,日常知识仍是待改进的领域,这与更广泛讨论的复杂推理和专业知识等能力并存。除了用于大语言模型评估,我们希望Global PIQA能展现人类语言所根植的丰富文化多样性。