Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.
翻译:现有的大语言模型基准主要局限于高资源或中资源语言,且通常评估模型在推理与生成等高级任务上的表现。然而,大量证据表明,大语言模型对全球3800余种书面语言中的绝大多数缺乏基本语言能力。我们提出ChiKhaPo基准,该基准包含8个难度各异的子任务,旨在评估生成模型的词汇理解与生成能力。ChiKhaPo基于现有词库、单语数据及双语平行语料,为其中2个子任务覆盖2700余种语言,在语言覆盖范围上超越现有任何基准。我们进一步表明,6个最先进模型在该基准上表现不佳,并探讨了影响性能评分的因素,包括语系、语言资源丰度、任务类型以及理解与生成方向。通过ChiKhaPo,我们期望能够促进并鼓励大语言模型的大规模多语言基准测试。