Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive a reward-based filtered expectation-maximization (FEM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution of rationales that justify correct answers. We instantiate and compare three sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR that conditions on the correct answer in the prompt. We experiment with LLM-as-a-judge calibration and summarization from feedback tasks, where conditioning on the correct answer provides a strong guidance for generating rationales. Our experiments show the efficacy of PPS over other sampling schemes, and that the sampling scheme can have a significant impact on performance.
翻译:大语言模型(LLMs)通过先生成推理依据再给出答案来解决推理问题。我们将推理形式化为一个隐变量模型,并推导出一种基于奖励的过滤期望最大化(FEM)目标函数,用于学习推理。这一视角将期望最大化与现代基于奖励的优化方法联系起来,并表明主要挑战在于设计一个能够合理解释正确答案的推理依据采样分布。我们具体实现并比较了三种采样方案:带预算的拒绝采样、自教导推理器(STaR),以及提示后验采样(PPS)——该方法仅保留STaR中基于提示中正确答案生成推理依据的阶段。我们在LLM作为评判器的校准任务和基于反馈的摘要生成任务上进行了实验,其中基于正确答案的条件为生成推理依据提供了强有力的指导。实验结果表明,PPS相比其他采样方案具有更高的效能,且采样方案对性能有显著影响。