Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency -- the shared initial reasoning steps across diverse solution trajectories -- to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75% and sampling cost by 99%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model's structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.
翻译:提升大语言模型(LLM)的推理能力通常需要带标注数据的监督微调或计算成本高昂的采样。我们提出了无监督前缀微调(UPFT),该方法利用了前缀自一致性(即不同解轨迹间共享的初始推理步骤)的观察,以提高LLM的推理效率。通过仅对初始前缀子串(少至8个标记)进行训练,UPFT消除了对标注数据或穷举采样的需求。在推理基准测试上的实验表明,UPFT的性能与监督方法(如拒绝采样微调)相当,同时将训练时间减少了75%,采样成本降低了99%。进一步分析表明,错误往往出现在推理过程的后期阶段,而基于前缀的训练保留了模型的结构知识。这项工作展示了如何通过极简的无监督微调在LLM中释放显著的推理能力提升,为传统方法提供了一种可扩展且资源高效的替代方案。