Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress
翻译:思维链(CoT)推理通过逐步解决问题的方式增强了大语言模型(LLMs)的能力,但其扩展至长思维链(Long-CoT)时,由于标记长度增加,会引入显著的计算开销。现有的压缩方法——实例级和标记级——要么牺牲了如反思等关键的局部推理信号,要么产生不连贯的输出。为解决这些局限,我们提出了R1-Compress,一个两阶段的分块级压缩框架,旨在同时保留局部信息和连贯性。我们的方法将长思维链分割为可管理的分块,应用LLM驱动的分块内压缩,并采用分块间搜索机制来选择简短且连贯的序列。在Qwen2.5-Instruct模型上,针对MATH500、AIME24和GPQA-Diamond数据集的实验表明,R1-Compress显著减少了标记使用量,同时保持了可比的推理准确率。在MATH500上,R1-Compress实现了92.4%的准确率,与长思维链基线相比仅下降0.6%,同时减少了约20%的标记使用量。源代码将在https://github.com/w-yibo/R1-Compress提供。