Inference-time scaling (ITS) in latent reasoning models typically relies on heuristic perturbations, such as dropout or fixed Gaussian noise, to generate diverse candidate trajectories. However, we show that stronger perturbations do not necessarily yield better sampling quality: they often induce larger distribution shifts without producing more useful reasoning paths or better final decisions. A key limitation is that these perturbations inject stochasticity without defining an explicit conditional sampling distribution, making latent exploration difficult to control or optimize. To address this, we propose the Gaussian Thought Sampler (GTS), a lightweight module that reformulates latent exploration as sampling from a learned conditional distribution over continuous reasoning states. GTS predicts context-dependent perturbation distributions and is trained with GRPO-style policy optimization while keeping the backbone frozen, turning heuristic perturbation into an explicit probabilistic sampling policy. Experiments across multiple benchmarks and two latent reasoning architectures show that GTS yields more reliable inference-time scaling than heuristic baselines, suggesting that effective latent ITS requires better-controlled and optimizable sampling rather than simply amplifying stochasticity.
翻译:潜在推理模型中的推断时扩展通常依赖于启发式扰动(如dropout或固定高斯噪声)来生成多样化的候选轨迹。然而,我们发现更强的扰动并不必然带来更好的采样质量:它们往往引发更大的分布偏移,却未能产生更有用的推理路径或更优的最终决策。一个关键局限在于,这些扰动仅注入随机性而未定义显式的条件采样分布,使得潜在探索难以控制或优化。为解决这一问题,我们提出了高斯思维采样器——一种轻量级模块,将潜在探索重新定义为从连续推理状态的学习条件分布中进行采样。GTS能够预测上下文相关的扰动分布,并通过GRPO风格策略优化进行训练(同时保持主干网络冻结),从而将启发式扰动转化为显式的概率采样策略。在多个基准测试和两种潜在推理架构上的实验表明,GTS相比启发式基线能实现更可靠的推断时扩展,这表明有效的潜在推断时扩展需要更优控制和可优化的采样机制,而非简单地增强随机性。