For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.
翻译:在大语言模型(LLM)的发展中,近期生成伪中间推理的方法取得了显著进展,但这类方法通常依赖大量正确标注的答案来评估推理质量。本文提出一种半监督框架,能够从极少量监督信号中扩展推理学习,将推理验证本身转化为数据生成机制。我们仅需少量标注样本训练一个轻量级推理正确性分类器,用于判断LLM生成的中间推理轨迹是否有效。此外,基于熵的置信度阈值过滤不可靠样本,保留的高置信度推理轨迹用于微调模型。在可验证数学问题(Orca-Math子集)和基于视觉编程的图像场景图问答(GQA)上的实验表明,本方法能达到与使用10-15倍标注数据相当的精度。消融分析证实,分类器与熵过滤对于实现可扩展且抗噪的伪标签生成均至关重要。通过将昂贵的答案级监督替换为轻量推理验证,本方法为构建大规模推理资源提供了可行路径,并为未来从最小人类输入中进行自主学习的推理系统奠定了基础。