Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.
翻译:现有前沿模型的基准测试通常考察专业化的"博士级"知识,这对非专家而言难以掌握。相比之下,我们提出了基于NPR周日谜题挑战的基准测试,该测试仅需通用知识。我们的基准对人类和模型都具有挑战性,但正确答案易于验证,且模型的错误易于识别。本研究揭示了现有基准未能体现的能力差距:在测试专业知识的基准上表现相当的其他推理模型中,OpenAI o1显著优于它们。此外,我们对推理输出的分析发现了新型失败模式。例如,DeepSeek R1经常在提供明知错误的答案前以"我放弃"认输。R1在其输出中也可能表现出显著的"不确定性",在极少数情况下甚至不会"完成思考",这表明需要一种推理时技术来在达到上下文窗口限制前"收尾"。我们还量化了R1和Gemini Thinking延长推理时间的有效性,以确定超出该点后进一步推理不太可能提升我们基准测试准确率的临界阈值。