Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
翻译:大语言模型(LLMs)的长上下文推理通过思维链(CoT)推断机制显著提升了其认知能力。此类模型的训练通常采用基于可验证奖励的强化学习(RLVR)方法,适用于数学、编程等推理类问题。然而,RLVR存在若干瓶颈:奖励信号稀疏、样本效率不足等问题,导致后训练阶段需要消耗大量计算资源。为克服这些限制,本研究提出**语义软引导(SSB)**——一种自蒸馏技术,其中同一个基础语言模型同时扮演教师和学生的角色,但在训练时接收关于其输出结果正确性的不同语义上下文。模型首先接收数学问题提示并生成多次推理轨迹,从中筛选出正确答案和最常见错误答案,随后将这些信息作为上下文输入模型,使其生成具有验证最终答案的、更鲁棒的逐步解释。该流程可从原始问题-答案数据中自动构建配对的教师-学生训练集,无需任何人工干预。此生成过程同时产生对数概率序列,学生模型在训练阶段仅根据原始问题尝试匹配该序列。实验中,我们通过参数高效微调在GSM8K数据集上训练Qwen2.5-3B-Instruct模型,并在MATH500和AIME2024基准测试中评估其准确率。实验结果表明,相较于常用的RLVR算法——群体相对策略优化(GRPO),我们的方法在两项测试中分别实现了10.6%和10%的准确率提升。代码开源地址:https://github.com/purbeshmitra/semantic-soft-bootstrapping,模型及构建的数据集发布于:https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping。