In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, a 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.
翻译:本文提出了LingOly基准测试,这是一个用于评估大型语言模型高级推理能力的新型基准。通过使用具有挑战性的语言学奥林匹克谜题,我们评估了模型在以下两方面的能力:(i)在极低资源或灭绝语言中进行上下文语境下的语言模式识别与泛化能力;(ii)遵循复杂任务指令的能力。LingOly基准覆盖了90多种以低资源为主的语言,最大限度地减少了数据污染问题,并包含1,133个问题,涵盖6种题型和5个级别的人工标注难度。我们采用直接准确率以及与无上下文基线的比较来评估性能,以惩罚机械记忆行为。对11个前沿大型语言模型的测试结果表明,该基准具有挑战性,模型在较高难度问题上表现不佳。在较难问题上,即使最优模型也仅达到38.7%的准确率,仅比无上下文基线提高24.7%。大型闭源模型通常优于开源模型,且总体而言,语言资源越丰富,模型得分越高。这些结果表明,在排除记忆因素后,真正的多步骤跨领域推理对当前语言模型而言仍是一项挑战。