Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.

翻译：采用思维链推理的大型语言模型虽表现优异，但过度消耗token导致推理成本激增。现有显式长度惩罚、难度估计器或多阶段课程等效率优化方法，要么降低推理质量，要么需要复杂的训练流程。我们提出批量上下文强化——一种极简的单阶段训练范式，通过简单的结构改造实现高效推理：在共享上下文窗口中训练模型同时解决N个问题，仅以实例级准确率作为奖励。该机制通过隐式token预算产生若干关键发现：（1）发现新的任务缩放定律：推理时并发问题数N增加时，单问题token使用量呈单调递减，而准确率下降幅度远低于基线方法，确立了N作为可控吞吐量维度；（2）BCR挑战传统精度-效率权衡，在标准单问题推理中展现"免费午餐"现象。在1.5B和4B模型系列中，BCR在五大数学基准上持续保持或提升精度的同时，将token使用量降低15.8%至62.6%；（3）定性分析揭示涌现的自调节效率机制，模型无需显式长度监督即可自主消除冗余元认知循环；（4）关键的是，我们通过实证证明隐式预算约束成功规避了显式长度惩罚固有的对抗梯度与灾难性优化崩溃问题，为长度控制提供了高度稳定的约束型替代方案。这些结果证实BCR的实用性，表明简单的结构激励即可释放LLM潜在的高密度推理能力。