R1-Compress：通过分块压缩与搜索实现长思维链压缩 (R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search)

Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress

翻译：思维链（CoT）推理通过逐步解决问题的方式增强了大语言模型（LLMs）的能力，但其扩展至长思维链（Long-CoT）时，由于标记长度增加，会引入显著的计算开销。现有的压缩方法——实例级和标记级——要么牺牲了如反思等关键的局部推理信号，要么产生不连贯的输出。为解决这些局限，我们提出了R1-Compress，一个两阶段的分块级压缩框架，旨在同时保留局部信息和连贯性。我们的方法将长思维链分割为可管理的分块，应用LLM驱动的分块内压缩，并采用分块间搜索机制来选择简短且连贯的序列。在Qwen2.5-Instruct模型上，针对MATH500、AIME24和GPQA-Diamond数据集的实验表明，R1-Compress显著减少了标记使用量，同时保持了可比的推理准确率。在MATH500上，R1-Compress实现了92.4%的准确率，与长思维链基线相比仅下降0.6%，同时减少了约20%的标记使用量。源代码将在https://github.com/w-yibo/R1-Compress提供。

相关内容