We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
翻译:摘要:本文提出SLR,一个通过可扩展逻辑推理对大型语言模型进行系统评估与训练的端到端框架。针对用户的任务规范,SLR自动合成:(i) 归纳推理任务的指令提示,(ii) 可在模型输出上执行以提供可验证奖励的验证程序,以及(iii) 隐含的ground-truth规则。该过程完全自动化、可扩展、无需人工标注,并能精确控制任务难度。利用SLR,我们构建了SLR-Bench基准,包含19,000个提示,组织为20个逐级增加的课程级别,在关系、算术和递归复杂性上渐进提升。大规模评估表明,当代LLM虽能轻易生成语法有效的规则,却常在正确逻辑推理上失败。近期推理型LLM虽性能提升,但测试阶段计算代价极高——仅处理1,000个提示的成本便超过300美元。最后,通过SLR的课程学习使Llama-3-8B在SLR-Bench上的准确率翻倍,以极低计算成本达到与Gemini-Flash-Thinking相当的水平。此外,这些推理能力可泛化至广泛既定基准,凸显了SLR在下游推理任务中的有效性。