Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs' efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dynamic Thinking-Token Selection (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency.
翻译:大型推理模型(LRMs)通过显式生成推理轨迹再推导最终答案,在解决复杂问题上表现出色。然而,这些扩展生成过程会带来显著的内存占用和计算开销,成为LRMs效率的瓶颈。本研究利用注意力图分析推理轨迹的影响,发现了一个有趣的现象:推理轨迹中仅有部分决策关键令牌引导模型走向最终答案,其余令牌的贡献可忽略不计。基于这一观察,我们提出了动态思维令牌选择方法。该方法在推理过程中识别决策关键令牌,仅保留其关联的键值缓存状态,同时剔除其余冗余条目以优化效率。