The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.
翻译:大型语言模型(LLM)测试时扩展技术的出现,以OpenAI的o1系列为代表,通过扩展推理过程中的计算资源分配提升了推理能力。尽管后续模型如QwQ、Deepseek-R1(R1)和LIMO复现了这些进展,但这些模型是否真正具备测试时扩展能力仍未得到充分探究。本研究发现,这些类o1模型更长的思维链(CoT)并不能持续提升准确率;实际上,对于相同问题,正确解决方案往往比错误方案更短。进一步研究表明,该现象与模型的自我修正能力密切相关——更长的思维链包含更多自我修正,而这往往导致性能下降。随后,我们在QwQ、R1和LIMO上比较了顺序扩展与并行扩展策略,发现并行扩展能实现更好的覆盖范围和可扩展性。基于这些发现,我们提出了最短多数投票法,该方法结合并行扩展策略与思维链长度特征,相较于传统多数投票方法显著提升了模型的测试时扩展能力。