Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{https://github.com/SalesforceAIResearch/LaTRO}.
翻译:大语言模型(LLMs)已展现出令人印象深刻的能力,但在需要多步骤的复杂推理任务上仍存在困难。虽然基于提示的方法(如思维链)能在推理时提升LLM的表现,但在训练过程中优化其推理能力仍具挑战性。我们提出了潜在推理优化(LaTRO),这是一个将推理形式化为从潜在分布中采样、并通过变分方法进行优化的原则性框架。LaTRO使LLMs能够同步提升其推理过程与评估推理质量的能力,且无需外部反馈或奖励模型。我们通过在GSM8K和ARC-Challenge数据集上使用多种模型架构进行实验验证了LaTRO的有效性。在GSM8K上,LaTRO相较于Phi-3.5-mini、Mistral-7B和Llama-3.1-8B的基础模型,平均零样本准确率提升了12.5%;相较于监督微调方法,平均提升了9.6%。我们的研究结果表明,预训练LLMs具备潜在的推理能力,可通过我们提出的优化方法以自我改进的方式被解锁和增强。LaTRO的代码已公开于 \url{https://github.com/SalesforceAIResearch/LaTRO}。