Scaling test-time compute via long Chain-of-Thought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinking demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a three times throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.
翻译:通过长思维链扩展测试时计算,显著提升了推理能力,但受限于键值缓存的线性增长和二次注意力复杂度,这种方法面临实际应用瓶颈。本文提出的手风琴式思维框架是一种端到端方法,使大语言模型能够通过动态摘要机制自调节推理步骤的粒度。该机制启用了折叠推理模式,模型定期总结推理过程并丢弃先前内容,以减少对历史token的依赖。我们通过强化学习进一步激励该能力,并揭示了关键发现:高效折叠模式与穷举展开模式之间的准确率差距随训练进程逐渐缩小直至消失。这一现象表明,模型学会了将关键推理信息编码为紧凑摘要,实现了推理上下文的有效压缩。我们的手风琴式思维框架证明,通过习得的自压缩能力,大语言模型能够以最小依赖token开销处理复杂推理任务而不降低解决方案质量,在48GB GPU内存配置下保持准确率的同时实现三倍吞吐量,而结构化的步骤摘要为推理过程提供了人类可读的记录。