Eliciting reasoning capabilities from language models (LMs) is a critical direction on the path towards building intelligent systems. Most recent studies dedicated to reasoning focus on out-of-distribution performance on procedurally-generated synthetic benchmarks, bespoke-built to evaluate specific skills only. This trend makes results hard to transfer across publications, slowing down progress. Three years ago, a similar issue was identified and rectified in the field of neural algorithmic reasoning, with the advent of the CLRS benchmark. CLRS is a dataset generator comprising graph execution traces of classical algorithms from the Introduction to Algorithms textbook. Inspired by this, we propose CLRS-Text -- a textual version of these algorithmic traces. Out of the box, CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks across any desirable input distribution, while offering a standard pipeline in which any additional algorithmic tasks may be created in the benchmark. We fine-tune and evaluate various LMs as generalist executors on this benchmark, validating prior work and revealing a novel, interesting challenge for the LM reasoning community. Our code is available at https://github.com/google-deepmind/clrs/tree/master/clrs/_src/clrs_text.
翻译:从语言模型中激发推理能力是构建智能系统道路上的关键方向。近期大多数致力于推理的研究都集中于对程序化生成的合成基准进行分布外性能评估,这些基准仅为评估特定技能而定制构建。这一趋势使得研究成果难以在不同出版物间迁移,从而减缓了进展。三年前,在神经算法推理领域,随着 CLRS 基准的出现,一个类似的问题被识别并得以纠正。CLRS 是一个数据集生成器,包含来自《算法导论》教科书中经典算法的图执行轨迹。受此启发,我们提出了 CLRS-Text——这些算法轨迹的文本版本。开箱即用,CLRS-Text 能够为三十种多样且具有挑战性的算法任务,在任何期望的输入分布上,程序化地生成轨迹数据,同时提供一个标准流程,可以在该基准中创建任何额外的算法任务。我们在此基准上对各种语言模型作为通用执行器进行微调和评估,验证了先前的工作,并为语言模型推理社区揭示了一个新颖且有趣的挑战。我们的代码可在 https://github.com/google-deepmind/clrs/tree/master/clrs/_src/clrs_text 获取。