Large Language Models (LLMs) increasingly rely on intermediate reasoning, yet explicit Chain-of-Thought (CoT) suffers from a linguistic space bottleneck: each thought must be decoded into tokens, causing high inference overhead. Latent reasoning moves deliberation into continuous space, but existing methods mostly learn deterministic or reward-maximizing paths, lacking a principled way to allocate probability across trajectories with different correctness and costs. We propose Latent Thought Flow (LTF), which models reasoning as variable-length continuous trajectories and trains a sampler to match a reward-induced posterior over answer quality and computation cost. We instantiate this with a continuous GFlowNet using stochastic latent transitions. To handle sparse answer supervision, we introduce an Entropy-Weighted Subtrajectory Balance objective for intermediate rewards and a reference-prior regularizer to anchor exploration. Experiments under finetuning and transfer learning settings show that LTF outperforms explicit CoT and latent reasoning baselines, improving accuracy by 9.5% while reducing reasoning length by 27.2% on average compared with strong latent reasoning baselines.
翻译:大型语言模型(LLMs)日益依赖中间推理过程,然而显式的思维链(CoT)受限于语言空间瓶颈:每个思维都必须解码为令牌,导致高昂的推理开销。潜在推理将思考过程迁移至连续空间,但现有方法大多学习确定性或奖励最大化路径,缺乏一种在具有不同正确性与计算代价的轨迹之间分配概率的原则性方式。我们提出潜在思维流(Latent Thought Flow,LTF),它将推理建模为变长连续轨迹,并训练一个采样器以匹配基于奖励的后验分布(涵盖答案质量与计算成本)。我们通过使用随机潜在转移的连续GFlowNet对其进行实例化。为处理稀疏的答案监督,我们引入熵加权子轨迹平衡目标(用于中间奖励)与参考先验正则化项(用于锚定探索)。在微调与迁移学习设置下的实验表明,与强基线潜在推理方法相比,LTF在平均提升9.5%准确率的同时将推理长度减少27.2%,性能优于显式CoT与潜在推理基线方法。