Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches the full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.
翻译:推理时缩放技术通过在推理阶段额外增加计算量而无需重新训练,显著提升了大型语言模型(LLM)的推理能力。类似地,链式推理(CoT)提示及其扩展形式长链推理通过生成丰富的中间推理轨迹来提高准确率,但这些方法会消耗大量令牌成本,阻碍了它们在延迟敏感型场景中的部署。本文首先证明,截断链式推理(即在完成推理前终止并直接生成最终答案)通常可在显著减少令牌使用量的同时达到与完整链式推理相当的性能。基于此发现,我们提出分段采样——一种统一的推理时策略,可在三个正交维度上对完整链式推理与仅求解采样进行插值:(1)推理轨迹数量;(2)每条轨迹的最终解数量;(3)推理痕迹被截断的深度。通过在五个多样化推理基准测试及多个模型规模上的广泛实验,我们证明分段采样始终能实现更优的准确率-成本权衡,在Pass@k指标与令牌预算之间产生陡峭的对数线性缩放增益。我们的分析揭示了如何跨这些维度分配计算资源以最大化性能,为更高效、可扩展的LLM推理铺平道路。代码已开源:https://github.com/BaohaoLiao/frac-cot。