BIG-Bench Extra Hard - 专知论文

Mehran Kazemi,Bahare Fatemi,Hritik Bansal,John Palowitch,Chrysovalantis Anastasiou,Sanket Vaibhav Mehta,Lalit K. Jain,Virginia Aglietti,Disha Jindal,Peter Chen,Nishanth Dikkala,Gladys Tyen,Xin Liu,Uri Shalit,Silvia Chiappa,Kate Olszewska,Yi Tay,Vinh Q. Tran,Quoc V. Le,Orhan Firat

Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.

翻译：大型语言模型（LLM）在日常应用中的部署日益广泛，这要求其具备强大的通用推理能力和多样化的推理技能。然而，当前的LLM推理基准主要集中于数学和编程能力，在评估更广泛的推理熟练度方面存在空白。一个特别的例外是BIG-Bench数据集，它因其多样化的挑战性任务集合，能够在统一框架内对LLM在各种技能上的通用推理能力进行全面评估，从而成为评估LLM通用推理能力的关键基准。然而，LLM的最新进展导致其在BIG-Bench及其更难版本BIG-Bench Hard（BBH）上的表现趋于饱和。最先进的模型在BBH的许多任务上取得了接近完美的分数，从而削弱了其实用性。为了解决这一局限，我们引入了BIG-Bench Extra Hard（BBEH），这是一个旨在突破LLM推理评估边界的新基准。BBEH将BBH中的每个任务替换为一个新颖的任务，该任务探究相似的推理能力，但展现出显著增加的难度。我们在BBEH上评估了各种模型，观察到最佳通用模型的（调和）平均准确率为9.8%，而最佳推理专用模型的平均准确率为44.8%，这表明仍有巨大的改进空间，并突显了在LLM中实现稳健通用推理所面临的持续挑战。我们在以下地址公开发布BBEH：https://github.com/google-deepmind/bbeh。