We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort scaling, where GPT-5.2 improves 81x from no reasoning to maximum effort; and (2) agentic iteration, where Claude Opus 4.6 rises from 0.3% to 30.0% through iterative checking, while GPT-5.2@xhigh improves from 20.2% to 56.0%. Agentic attempts span a median of 29 turns over 17 minutes, with the longest exceeding 1,221 turns and 14.3 hours - a demanding test of long-context utilization, not just reasoning.
翻译:我们提出了铅笔谜题基准,这是一个通过铅笔谜题评估大语言模型推理能力的框架。铅笔谜题是一类与NP完全问题密切相关的约束满足问题,具有确定性的、步骤级的可验证性。我们从包含94个种类、62,231个已验证具有唯一解的谜题数据库中,选取了涵盖20个种类的300个谜题作为基准,并以两种模式评估了来自11个提供商的51个模型:直接提问(单次)和代理模式(多轮迭代验证)。我们基准的一个关键区别在于,每个中间棋盘状态都可以根据特定种类的约束进行检查,从而将错误定位到违反的确切规则上,这为过程监督和强化学习提供了密集的、每步动作的奖励信号基础设施。我们的评估揭示了两个不同的能力维度:(1) 推理努力扩展,其中GPT-5.2从无推理到最大努力提升了81倍;(2) 代理迭代,其中Claude Opus 4.6通过迭代检查从0.3%提升到30.0%,而GPT-5.2@xhigh则从20.2%提升到56.0%。代理尝试的中位数是29轮对话,耗时17分钟,最长的尝试超过了1,221轮对话和14.3小时——这不仅是对推理能力的考验,也是对长上下文利用能力的严苛测试。