Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.
翻译:潜在推理模型中的推断时扩展通常通过启发式扰动引入随机性,例如随机丢弃或固定高斯噪声。虽然这些方法增加了轨迹多样性,但其探索行为未被显式建模,在有限采样预算下可能效率低下。我们观察到,更强的扰动并不必然产生更有效的候选轨迹,因为无引导的噪声可能破坏内部决策结构而非引导其发展。为提供更具结构化的替代方案,我们将潜在思维探索建模为从可学习密度分布的条件采样,并将该思想实例化为高斯思维采样器。GTS通过连续推理状态预测上下文相关的扰动分布,并在保持骨干网络冻结的情况下采用GRPO风格策略优化进行训练。在GSM8K数据集上对两种潜在推理架构的实验表明,GTS相比启发式基线实现了更可靠的推断时扩展。这些发现表明,改进潜在推理的推断时扩展需要结构化且可优化的探索机制,而非简单放大随机性。