We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
翻译:我们为语言模型引入了一个变分推理框架,该框架将思维轨迹视为潜变量并通过变分推断对其进行优化。从证据下界出发,我们将其扩展为多轨迹目标以获得更紧的界,并提出了一种前向KL公式以稳定变分后验的训练。我们进一步证明,拒绝采样微调和二元奖励强化学习(包括GRPO)可被解释为局部前向KL目标,其中模型准确性的隐式加权自然地从推导中产生,并揭示了一种先前未被注意到的对更简单问题的偏向。我们在广泛的推理任务上,对Qwen 2.5和Qwen 3模型系列进行了实证验证。总体而言,我们的工作提供了一个原则性的概率视角,将变分推断与强化学习风格的方法统一起来,并产生了用于提升语言模型推理能力的稳定目标。我们的代码可在 https://github.com/sail-sg/variational-reasoning 获取。