SLR: Automated Synthesis for Scalable Logical Reasoning

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

翻译：摘要：本文提出SLR，一个通过可扩展逻辑推理对大型语言模型进行系统评估与训练的端到端框架。针对用户的任务规范，SLR自动合成：(i) 归纳推理任务的指令提示，(ii) 可在模型输出上执行以提供可验证奖励的验证程序，以及(iii) 隐含的ground-truth规则。该过程完全自动化、可扩展、无需人工标注，并能精确控制任务难度。利用SLR，我们构建了SLR-Bench基准，包含19,000个提示，组织为20个逐级增加的课程级别，在关系、算术和递归复杂性上渐进提升。大规模评估表明，当代LLM虽能轻易生成语法有效的规则，却常在正确逻辑推理上失败。近期推理型LLM虽性能提升，但测试阶段计算代价极高——仅处理1,000个提示的成本便超过300美元。最后，通过SLR的课程学习使Llama-3-8B在SLR-Bench上的准确率翻倍，以极低计算成本达到与Gemini-Flash-Thinking相当的水平。此外，这些推理能力可泛化至广泛既定基准，凸显了SLR在下游推理任务中的有效性。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

迈向大推理模型的机理理解：关于训练、推理及失效模式的综述

专知会员服务

17+阅读 · 1月29日

重新审视测试时扩展：一项综述与面向多样性的高效推理方法

专知会员服务

10+阅读 · 2025年6月8日

【WWW2025】G-Refer：基于图检索增强的大型语言模型用于可解释推荐

专知会员服务

13+阅读 · 2025年4月8日

大型语言模型推理前沿综述：推理扩展、学习推理与自主智能系统

专知会员服务

38+阅读 · 2025年4月7日