Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

翻译：给定一个问题，语言模型（LM）会隐式地编码一个关于可能答案的分布。然而在实际训练后，语言模型往往将这一分布坍缩到单个主导模式上。虽然这在假定存在唯一正确答案的基准评估中通常不成问题，但许多现实任务本质上涉及多个有效答案或不可约的不确定性，例如医学诊断、模糊问答以及信息不完整场景。在这些情况下，我们希望语言模型能够生成多个合理假设，最好能为每个假设提供置信度估计，且无需通过计算密集的重复采样来生成非模态答案。本文提出了一种多答案强化学习方法，用于训练语言模型在推理过程中执行分布推理。我们修改了强化学习目标，使模型能在单次前向传播中显式生成多个候选答案，将推理时搜索的某些方面内化到模型的生成过程中。在问答、医学诊断和编程基准测试中，与单答案训练基线相比，我们观察到多样性、覆盖率和集合级校准分数均有提升。采用本方法训练的模型生成多个答案所需的token数少于竞争方法。在编程任务中，它们的准确率也显著更高。这些结果表明，多答案强化学习可作为一种更具原则性和计算效率的替代方案，用于取代如最佳候选采样（best-of-k）等推理时扩展策略。代码及更多信息请访问https://multi-answer-rl.github.io/。