Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
翻译:大型语言模型(LLM)通过思维链(CoT)推理实现了卓越性能,但这些基于词元的推理链计算成本高昂且效率低下。本文提出压缩潜在推理(CoLaR)这一新颖框架,该框架通过两阶段训练方法在潜在空间中动态压缩推理过程。首先,在有监督微调阶段,CoLaR通过引入辅助的“下一压缩嵌入预测”目标,扩展了传统的下一词元预测任务。该过程使用从预设范围随机采样的压缩因子合并连续词元的嵌入表示,并训练一个专用的潜在头来预测后续压缩嵌入的分布。其次,我们通过强化学习(RL)增强CoLaR,利用潜在头的非确定性特性探索多样化的推理路径,并选择更紧凑的路径。这种方法使CoLaR能够:i)在稠密潜在层面(即静默地)执行推理,显著缩短推理链长度;ii)在推理时通过简单提示所需压缩因子来动态调整推理速度。在四个数学推理数据集上的大量实验表明,在可比压缩率下,CoLaR比基于潜在空间的基线方法准确率提升14.1%;与显式CoT方法相比,推理链长度减少53.3%而性能仅下降4.8%。此外,当应用于更具挑战性的数学推理任务时,经RL增强的CoLaR在将潜在推理链长度大幅缩减82.8%的同时,实现了最高5.4%的性能提升。代码与模型将在论文录用后开源。