The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.
翻译:大型推理模型的出现表明,扩展推理时计算量能显著提升复杂任务上的性能。然而,这往往陷入另一个困境:对简单问题过度思考,即重复生成推理链仅带来微小的准确率提升,却需付出不成比例的高昂代价。这催生了自适应推理的需求:动态调整推理深度以匹配实例难度。本文从最优性视角研究自适应推理,将其形式化为一个效用最大化问题,即持续分配计算令牌直至边际准确率增益低于增量成本。基于此,我们提出CODA(基于难度感知的计算分配方法),该方法通过策略内部难度信号分配令牌来实现这一原则。具体而言,CODA通过基于分组的推演轨迹估计难度,并将其映射为两个非负门控信号,用于调节二元基础奖励之上与长度相关的塑形项。简单侧门控惩罚对简单实例的冗余输出,而困难侧门控则鼓励对挑战性实例进行更审慎的推演。在不同模型规模和基准测试中,CODA无需外部标注或用户预设预算即可实现自适应推理:在简单任务上,CODA在保持高准确率的同时降低超过60%的令牌成本;在困难任务上,则激励更审慎的推演以最大化性能。