Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.
翻译:大型推理模型(LRMs)在数学和编程等具有挑战性的问题上展现了卓越性能。然而,要获得高质量的解决方案,通常需要多次采样。原则上,有两种采样策略可以组合形成更复杂的过程:串行采样和并行采样。本文首先严谨比较了这两种方法,并观察到与先前研究一致的现象:尽管串行采样理论上应具备更强的表征能力,但并行采样的表现似乎更优。为探究其根本原因,我们提出了三种假设来解释这一行为:(i)并行采样的优势源于聚合算子;(ii)串行采样因需要更长的上下文而受损;(iii)串行采样因依赖于先前答案而导致探索不足。基于多种模型族和规模(Qwen3、DeepSeek-R1蒸馏模型、Gemini 2.5)以及问题领域(数学与编程)的实验证据表明,聚合算子和上下文长度并非造成性能差距的主要因素。相比之下,探索不足似乎起到了更为显著的作用,我们认为这是导致性能差距的关键原因之一。