Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
翻译:大型语言模型(LLMs)在数学推理方面取得了显著进展,通常采用拒绝采样进行微调,该方法仅保留正确的推理轨迹。尽管有效,这种范式将监督视为二元过滤器,系统性地排除了教师模型生成的错误,导致训练过程中对推理失败的建模存在空白。本文提出TrajFusion,一种将拒绝采样重构为结构化监督构建过程的微调策略。具体而言,TrajFusion通过将选定的错误轨迹与反思提示及正确轨迹交错组合,形成显式建模试错推理的融合轨迹。每个融合样本的长度根据教师模型错误的频率和多样性进行自适应控制,从而为具有挑战性的问题提供更丰富的监督,同时在错误信号信息量不足时安全地退化为标准拒绝采样微调(RFT)。TrajFusion无需改变模型架构或训练目标。在多个数学基准测试上的广泛实验表明,TrajFusion始终优于RFT,尤其在具有挑战性的长链推理问题上表现更为突出。