Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens that prompt self-evaluative reflection. These transition markers and reflective cues are referred to as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, We propose cyclical reflection token scheduling (termed CyclicReflex), a training-free decoding strategy that dynamically modulates reflection token logits with a bidirectional, position-dependent triangular waveform, incurring no additional computation cost. Experiments on MATH500, AIME2024/2025, AMC2023, GPQA Diamond and LiveCodeBench demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-14B), outperforming standard decoding and recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.
翻译:大型推理模型(LRMs),如OpenAI的o1与DeepSeek-R1,利用测试时扩展技术执行多步推理以解决复杂问题。在生成最终答案前执行的推理过程,常由特殊的衔接令牌引导,这些令牌会触发自我评估性反思。此类过渡标记与反思提示被称为“反思令牌”(例如“wait”、“but”、“alternatively”)。本研究将反思令牌视为一种“资源”,并引入资源分配问题,旨在通过自适应调节反思令牌的出现频率与位置,提升LRMs的测试时计算性能。实证分析表明,反思令牌的过度使用与不足使用(分别称为过度反思与反思不足)均会导致模型性能下降。为深入理解此权衡关系,我们将反思令牌的使用与优化中的学习率调度进行类比。基于此洞见,我们提出循环反思令牌调度(命名为CyclicReflex),这是一种无需额外训练的解码策略,通过双向、位置相关的三角波形动态调制反思令牌的对数概率,且不产生额外计算开销。在MATH500、AIME2024/2025、AMC2023、GPQA Diamond及LiveCodeBench上的实验表明,CyclicReflex能在不同模型规模(1.5B-14B)上持续提升性能,优于标准解码及近期方法如TIP(思维切换惩罚)与S1。代码发布于https://github.com/OPTML-Group/CyclicReflex。