We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
翻译:本文提出SLR,一种通过可扩展逻辑推理对大型语言模型进行系统性评估与训练的端到端框架。给定用户任务规范后,SLR能自动合成:(i) 归纳推理任务的指令提示,(ii) 可对模型输出执行以提供可验证奖励的验证程序,以及(iii) 潜在的基准真值规则。该流程完全自动化、可扩展,无需人工标注,并能精确控制任务难度。基于SLR,我们构建了包含1.9万个提示的SLR-Bench基准测试集,这些提示被划分为20个课程级别,在关系、算术和递归复杂度上逐级递增。大规模评估表明,当前大型语言模型虽能生成语法有效的规则,却常在正确逻辑推理上失败。近期出现的推理专用模型虽表现提升,但测试时计算成本极高——处理仅1000个提示的成本就超过300美元。最终,通过SLR进行的课程学习使Llama-3-8B在SLR-Bench上的准确率翻倍,仅以少量计算成本即达到与Gemini-Flash-Thinking相当的水平。此外,这些推理能力可泛化至多种现有基准测试,充分证明SLR对于下游推理任务的有效性。