Large language models (LLMs) have demonstrated outstanding performance across various tasks, yet they still exhibit limitations such as hallucination, unfaithful reasoning, and toxic content. One potential approach to mitigate these issues is learning from human or external feedback (e.g. tools). In this paper, we introduce an intrinsic self-correct reasoning framework for LLMs that eliminates the need for human feedback, external tools, and handcraft prompts. The proposed framework, based on a multi-step reasoning paradigm \textbf{Le}arning from \textbf{Co}rrectness (\textsc{LeCo}), improves reasoning performance without needing to learn from errors. This paradigm prioritizes learning from correct reasoning steps, and a unique method to measure confidence for each reasoning step based on generation logits. Experimental results across various multi-step reasoning tasks demonstrate the effectiveness of the framework in improving reasoning performance with reduced token consumption.
翻译:大语言模型(LLM)已在各类任务中展现出卓越性能,但仍存在幻觉、不可靠推理及有害内容等局限性。缓解这些问题的潜在方法之一是从人类或外部反馈(如工具)中学习。本文提出一种无需人工反馈、外部工具或手工提示的LLM内在自校正推理框架。该框架基于多步推理范式——从正确性中学习(LeCo),无需从错误中学习即可提升推理性能。该范式优先从正确推理步骤中学习,并采用一种基于生成对数概率的独特方法来度量每个推理步骤的置信度。在多种多步推理任务上的实验结果表明,该框架能以更少的令牌消耗有效提升推理性能。