Test-time scaling has shown that allocating more additional computation at inference can improve generation quality, motivating a natural follow-up question: where should this computation be spent? Building on this insight, we introduce PonderLM-3, a pretraining framework for token-wise adaptive pondering that learns to selectively allocate additional computation under purely self-supervised objectives, built on top of the PonderLM-2 backbone. This makes additional inference computation an allocatable per-token resource, so tokens receive more computation only when it is beneficial, rather than paying a uniform extra cost. To make this allocation learnable while maintaining train-inference consistency, PonderLM-3 injects a differentiable attention mask during pretraining and pairs it with a matching hard pruning rule at inference. PonderLM-3 defines a stronger Pareto frontier: compared with existing recursive or adaptive baselines, it achieves lower pretraining perplexity at equal inference FLOPs. On downstream benchmarks, PonderLM-3 attains comparable performance to fixed-step PonderLM-2 under the same maximum number of additional computation steps, while using fewer inference FLOPs in practice. Overall, PonderLM-3 provides an end-to-end differentiable and train-inference consistent framework for token-wise adaptive computation, enabling additional inference compute to be allocated where it is most useful rather than paid uniformly by every token.
翻译:测试时扩展研究表明,在推理阶段分配更多额外计算能够提升生成质量,这引出了一个自然的后续问题:这些计算应当被用于何处?基于这一洞见,我们在PonderLM-2主干网络基础上,提出了PonderLM-3——一种面向词元级自适应深思的预训练框架。该框架通过纯自监督目标学习选择性分配额外计算,从而使额外推理计算成为一种可按词元分配的资源:仅当有益时,词元才会获得更多计算,而非统一支付额外成本。为使这种分配机制可学习且保持训练-推理一致性,PonderLM-3在预训练期间注入可微分注意力掩码,并在推理时配以对应的硬剪枝规则。PonderLM-3定义了更强的帕累托边界:在同等推理FLOPs下,相较于现有递归或自适应基线方法,其预训练困惑度更低。在下游基准测试中,在相同最大额外计算步数条件下,PonderLM-3达到了与固定步数PonderLM-2相当的性能,同时实际消耗更少的推理FLOPs。总体而言,PonderLM-3为词元级自适应计算提供了端到端可微分且训练-推理一致的框架,使得额外推理计算能够被分配到最需要的位置,而非由每个词元均等承担。