Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \textbf{\underline{L}}atent R\textbf{\underline{e}}asoning \textbf{\underline{P}}olicy \textbf{\underline{O}}ptimization~(\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.
翻译:近期,潜在推理被引入大型语言模型,以利用连续空间中的丰富信息。然而,由于缺乏随机采样,这些方法不可避免地退化为确定性推理,无法探索多样化的推理路径。为弥补这一缺陷,我们通过Gumbel-Softmax向潜在推理注入可控随机性,恢复大语言模型的探索能力并增强其与强化学习的兼容性。在此基础上,我们提出潜在推理策略优化(LEPO)这一新颖框架,将强化学习直接应用于连续潜在表示。具体而言,在推演阶段,LEPO保持随机性以实现多样化轨迹采样;在优化阶段,LEPO为潜在表示与离散token构建统一的梯度估计。大量实验表明,LEPO在离散推理与潜在推理任务上显著优于现有强化学习方法。