Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.
翻译:大语言模型(LLMs)通过思维链提示展现出强大的推理能力,但显式推理会带来显著的计算开销。近期关于潜在推理的研究通过在潜在空间进行无显式监督的推理来降低这一成本,但其性能显著下降。我们的初步实验表明,这种性能退化源于非结构化的潜在空间,导致潜在标记难以拟合。为解决此问题,我们将潜在空间限制在LLM词汇表的列空间中,将潜在推理视为词汇概率上的叠加。一旦潜在推理完成,它会坍缩为显式推理的本征态以产生最终答案。基于这一思想,我们提出了Latent-SFT——一个两阶段学习框架。在第一阶段,我们设计了两种专用注意力掩码来引导潜在标记编码器生成潜在标记,使LLM能够基于这些标记生成正确答案。在第二阶段,潜在标记编码器被弃用,直接训练LLM自主生成这些潜在标记以进行潜在推理,并通过KL散度和交叉熵损失进行优化。Latent-SFT在GSM8k上达到了新的最优性能,在匹配显式SFT性能的同时将推理链长度缩减至多4倍,且优于先前的潜在推理方法。在Math500和AIME24数据集上,基于词汇概率的潜在推理也明显超越了基于隐藏状态的方法。我们提出的有效压缩率和有效全局并行度指标进一步表明,潜在推理既是单一路径的压缩,也是多路径的叠加。