Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.
翻译:大型语言模型(LLM)已展现出强大的零样本推理能力,这反映在其在现有测试任务中的表现上。这一现状要求我们设计更具挑战性的基准,需要极高的推理能力才能解决。本文提出这样一个基准,包含191个长篇(平均约1200词)悬疑叙事文本,构成侦探谜题。这些谜题来源于"5分钟谜案"平台,并配有选择题用于评估。人类解题的平均成功率仅为47%,而最佳解题者的成功率超过80%。我们证明,GPT-3模型在该基准上的准确率(28%)仅略高于随机水平,而最先进的GPT-4仅能解出38%的谜题。这表明LLM与人类在深层推理能力上仍存在显著差距,突显了该领域进一步研究的必要性。我们的工作为未来语言模型推理研究提供了具有挑战性的基准,有助于更深入理解LLM能力的边界。