Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%.

翻译：大型语言模型（LLM）通过思维链（CoT）推理实现了卓越性能，但这些基于词元级别的推理链计算成本高昂且效率低下。本文提出了一种新颖的框架——压缩潜在推理（CoLaR），该框架通过两阶段训练方法在潜在空间中动态压缩推理过程。首先，在有监督微调阶段，CoLaR通过引入辅助的“下一压缩嵌入预测”目标，扩展了传统的下一词元预测任务。该过程使用从预定义范围中随机采样的压缩因子合并连续词元的嵌入，并训练一个专门的潜在头来预测后续压缩嵌入的分布。其次，我们通过强化学习（RL）增强CoLaR，利用潜在头的非确定性特性探索多样化的推理路径，并利用更紧凑的路径。这种方法使CoLaR能够：i）在稠密的潜在层面（即“静默地”）执行推理，显著缩短推理链长度；ii）在推理时通过简单地提示所需的压缩因子来动态调整推理速度。在四个数学推理数据集上的大量实验表明，在可比压缩比下，CoLaR比基于潜在空间的基线方法准确率高出14.1%；与显式CoT方法相比，推理链长度减少了53.3%，而性能仅下降4.8%。此外，当应用于更具挑战性的数学推理任务时，我们通过RL增强的CoLaR在将潜在推理链长度大幅减少82.8%的同时，性能提升高达5.4%。