Adaptive Nucleus Truncation for Long-Form Reasoning

Sampling plays an important role in long-form language-model reasoning. Over thousands of decoding steps, small changes in the candidate token set can compound into different reasoning trajectories, stability profiles, and final answers. Existing truncation methods such as top-$p$, min-$p$, and fixed top-$nσ$ sampling improve over unrestricted sampling, but they rely on fixed thresholds that cannot adapt to changes in entropy, task difficulty, training stage, or generation budget. We introduce Adaptive Nucleus Truncation Sampling (ANTS), which extends top-$nσ$ sampling from a fixed decoding rule into an adaptive rollout-control mechanism for long-form generation. ANTS selects standardized neighborhoods around the maximum logit before temperature scaling, adapts the truncation width using an entropy-conditioned controller, and retains a no-truncation fallback arm to stabilize training when truncation becomes unsafe. On a 33B-total / 4B-active sparse Mixture-of-Experts reasoning model, ANTS improves average performance over percentage-based benchmarks by +1.9, +3.8, and +5.2 points at 8K, 16K, and 32K generation budgets, respectively. The strongest gains appear on instruction following and mathematical reasoning, with IFBench improving by more than 10 points at 32K and AIME 2025 improving by 7 points. Code generation reveals an important budget interaction. On Codeforces, ANTS trails the baseline at 8K, but reverses this gap and substantially improves ELO at 16K and 32K. These results suggest that sampler design should be treated not just as a decoding hyperparameter, but as part of how we stabilize and scale long-budget reasoning.

翻译：在长文本语言模型推理过程中，采样策略扮演着关键角色。当解码步数超过数千次时，候选词元集合的微小变化会累积成不同的推理轨迹、稳定性特征与最终答案。现有截断方法如top-$p$、min-$p$和固定top-$nσ$采样虽优于无限制采样，但其依赖的固定阈值无法适应熵值、任务难度、训练阶段或生成预算的动态变化。我们提出自适应核截断采样（ANTS），将固定解码规则中的top-$nσ$采样扩展为面向长文本生成的自适应展开控制机制。ANTS在温度缩放前选择最大logit周围的标准化邻域，通过熵条件控制器动态调整截断宽度，并保留无截断回退分支以防止截断不安全时影响训练稳定性。在参数量33B/激活量4B的稀疏混合专家推理模型上，ANTS在8K、16K和32K生成预算下分别比基于百分比的基准平均提升1.9、3.8和5.2个百分点。最大性能增益出现在指令遵循与数学推理任务中：IFBench在32K预算下提升超10个百分点，AIME 2025提升7个百分点。代码生成任务揭示了关键的预算交互效应：在Codeforces平台8K预算下ANTS落后基线，但在16K和32K预算下不仅逆转差距，还显著提升了ELO评分。这些结果表明，采样器设计不应仅视为解码超参数，更应作为稳定和扩展长预算推理的核心机制组成部分。