Parallel thinking enhances LLM reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N'< N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before decoding. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.
翻译:并行思考通过多路径采样与聚合来增强大语言模型的推理能力。在系统级评估中,通常会为所有样本分配一个全局并行度N,且通常设置得较大以最大化整体数据集准确率。然而,由于样本异质性,部分样本使用较小的N'<N即可获得相当的性能,从而导致预算冗余。这种系统级效能与样本级效率之间的不兼容性构成了过度扩展诅咒。本文形式化并量化了过度扩展诅咒,展示了其在实践中的普遍性与严重性,并分析了其触发机制。随后,我们提出了一种轻量级方法T2来打破此诅咒,该方法利用潜在表征在解码前为每个样本估计最优并行度。实验表明,T2在保持相当性能的同时显著降低了成本,从而实现了更高效的并行思考。