Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.
翻译:大型语言模型(LLM)在广泛采用的SFT+RLVR范式下展现出卓越的推理能力。该范式首先通过人类标注的推理轨迹(解释)进行监督微调(SFT)以建立初始推理行为,随后应用可验证奖励的强化学习(RLVR),在不依赖标准解释的情况下利用可验证信号优化模型。然而,为SFT阶段标注高质量解释的成本仍然极其高昂。本文研究了在何种情况下以及如何能在不损害推理性能的前提下大幅降低解释标注成本。我们识别出一类广泛的问题——称为模式化推理任务——其推理过程遵循固定且跨实例一致的程序化策略。尽管实例在领域知识、事实信息或数值等具体内容上存在差异,但解决方案均源自共享推理模式的应用。我们认为,SFT+RLVR在此类任务上的成功主要源于其使模型内化这些推理模式的能力。以数值语义匹配作为代表性任务,我们通过因果证据和行为证据表明,推理模式而非解释的数量或质量才是性能的关键决定因素。基于这些发现,我们提出模式感知的LLM作为解释标注器(PARO),这是一个简单而有效的框架,使LLM能够生成与任务特定推理模式对齐的解释,而无需人类标注的解释。实验表明,PARO生成的解释所实现的SFT+RLVR性能,与规模大10倍的人类标注解释相当。这些结果表明,大规模的人类解释标注可以被基于LLM的自动标注所替代,后者仅需对推理模式进行有限的人类监督。