Recent years have witnessed meteoric progress in reasoning models: neural networks that generate intermediate reasoning traces (RTs) before producing a final output. Despite the rapid advancement, our understanding of how RTs support reasoning, and the limits of this paradigm, remain incomplete. To promote greater clarity, we introduce PITA: a novel large-scale dataset of over 23 million statements in propositional logic and their corresponding proofs. As a benchmark for robust reasoning, we focus on length generalization: if a model is trained to determine truth or falsity on statements with proofs up to fixed length, how well does it generalize to statements requiring longer proofs? We propose notions of (1) task depth and (2) task breadth, which measure respectively (1) the number of steps required to solve an example from a task and (2) the number of unique examples across a task. We vary these quantities across subsets of PITA, and find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines. To determine whether our results are idiosyncratic to PITA or indicative of general phenomena, we compare our results to a simple synthetic task based on syllogisms. Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks. Our findings overall identify fundamental benefits and limitations inherent in using reasoning traces.
翻译:近年来,推理模型取得了迅猛发展:这类神经网络在生成最终输出前会先产生中间推理轨迹。尽管进展迅速,但我们对推理轨迹如何支持推理以及该范式的局限性仍缺乏完整理解。为促进更清晰的认识,我们提出了PITA:一个包含超过2300万命题逻辑语句及其对应证明的大规模数据集。作为稳健推理的基准,我们聚焦于长度泛化问题:若模型被训练用于判断具有固定长度证明的语句真值,其在需要更长证明的语句上的泛化能力如何?我们提出了(1)任务深度与(2)任务广度的概念,分别用于度量(1)解决任务中单个示例所需的步骤数,以及(2)任务中独特示例的数量。通过在PITA子集中调整这些量,我们发现推理轨迹模型在广而浅的子集上泛化良好,而在窄而深的子集上相对于非推理轨迹基线模型表现下降。为验证结果是否仅为PITA特有或反映普遍现象,我们将结果与基于三段论的简单合成任务进行对比。由此构建的理论揭示了限制推理轨迹模型在深度任务表现的基本缩放规律,并突显其在广度任务上的泛化优势。总体而言,我们的研究揭示了使用推理轨迹所固有的根本性优势与局限性。