Recent studies suggest that even data-efficient training with ($\simeq$1K) reasoning trajectories can induce non-trivial reasoning capabilities in large language models through post-training. Such training corpora often contain iconic tokens such as "wait", "so", and "alternatively", which frequently appear in reasoning trajectories and may play a role in this process. This paper focuses on characterizing observable token-level patterns in post-training and a case study of how data-efficient supervised fine-tuning (SFT) differs from, and falls short of, large-scale post-training. To this end, we first identify tokens that correlate with correct answers along reasoning trajectories across models and training setups. We then focus on the distribution and (functional) roles of the "wait" token to primarily study the model trained in a data-efficient manner compared with the counterpart. Our study finds that discourse tokens are associated with correctness and a reasoning accuracy jump, even in data-efficient SFT. This suggests data-efficient SFT can partially reproduce discourse-token patterns to mimic meaningful reasoning behavior, but the patterns are less aligned with high-confidence answer transitions than those from large-scale post-training.
翻译:近期研究表明,即便使用约1000条推理轨迹进行数据高效训练,也能通过后训练使大语言模型具备显著推理能力。此类训练语料常包含"wait""so""alternatively"等标志性词语,这些词汇频繁出现在推理轨迹中并可能在该过程中发挥作用。本文聚焦于后训练中可观测的标记级模式特征,并通过案例研究探讨数据高效监督微调(SFT)与大规模后训练的差异及局限性。为此,我们首先识别出不同模型与训练设置下沿推理轨迹与正确答案相关的标记。进而重点研究"wait"标记的分布与(功能)角色,通过对比分析数据高效训练模型与对照模型。研究发现,即便在数据高效SFT中,话语标记也与正确性和推理精度跃升存在关联。这表明数据高效SFT可部分复现话语标记模式以模拟有意义的推理行为,但与大规模后训练相比,其模式与高置信度答案转换的匹配度较低。