PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Although many benchmarks evaluate the reasoning abilities of Large Language Models (LLMs) within domains such as mathematics, coding, or data wrangling, few abstract away from domain specifics to examine reasoning as a capability in and of itself. We contribute a novel type of benchmark evaluating the inductive reasoning capabilities of LLMs that is inspired by the forward reconstruction task from historical linguistics but is formulated in an extremely simple, general way (in the form of Programming by Examples). The task involves generating a cascade of simple string rewrite programs to transform a given list of input strings into a list of desired output strings. We present a fully automated pipeline that programmatically generates problems of this type with controllable difficulty, enabling scalable evaluation of reasoning models while avoiding contamination. Using this approach, we construct two benchmarks: PBEBench-Lite, which efficiently stratifies models of varying capabilities, and PBEBench, which requires models to induce programs similar in complexity to those constructed by historical linguists. Our experiments reveal a substantial performance gap between models that leverage test-time compute or LCoT (long chain-of-thought) reasoning and those that do not. Moreover, although recent models show promise, the solve rate for both of them drops below 5% for hard instances of the PBEBench dataset (ground truth cascade lengths of 20 and 30, respectively), falling well short of realistic historical linguistics requirements even with computationally expensive, popular scaling techniques from the PBE and reasoning literature. Additionally, we also study the effectiveness of different scaling strategies and the impact of various hyperparameters on the difficulty of the generated data using gpt-oss-120b, the best-performing open-source model.

翻译：尽管现有许多基准测试评估大型语言模型（LLM）在数学、编程或数据整理等领域的推理能力，但鲜有研究能脱离具体领域细节，将推理本身作为一种独立能力进行考察。我们提出一种新型基准测试，用于评估LLM的归纳推理能力，其设计灵感来源于历史语言学中的正向重建任务，但以极其简单、通用的形式（即示例编程形式）进行构建。该任务要求生成一系列简单的字符串重写程序，将给定的输入字符串列表转换为期望的输出字符串列表。我们提出了一套全自动流程，能够以可控难度编程生成此类问题，从而实现推理模型的可扩展评估并避免数据污染。基于此方法，我们构建了两个基准：PBEBench-Lite（可有效区分不同能力水平的模型）和PBEBench（要求模型推导出与历史语言学家构建的程序复杂度相当的代码）。实验结果表明，利用测试时计算或长链思维推理的模型与未采用这些策略的模型之间存在显著性能差距。此外，尽管近期模型展现出潜力，但在PBEBench数据集的困难实例（真实级联长度分别为20和30）上，两者的解决率均低于5%，即使采用PBE与推理文献中流行的计算密集型扩展技术，仍远未达到实际历史语言学任务的要求。我们还基于性能最佳的开源模型gpt-oss-120b，研究了不同扩展策略的有效性以及各类超参数对生成数据难度的影响。