A diverse array of reasoning strategies has been proposed to elicit the capabilities of large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: the increased effectiveness due to additional compute. By overlooking this aspect, a skewed view of strategy efficiency is often presented. This paper introduces a framework that incorporates the compute budget into the evaluation, providing a more informative comparison that takes into account both performance metrics and computational cost. In this budget-aware perspective, we find that complex reasoning strategies often don't surpass simpler baselines purely due to algorithmic ingenuity, but rather due to the larger computational resources allocated. When we provide a simple baseline like chain-of-thought self-consistency with comparable compute resources, it frequently outperforms reasoning strategies proposed in the literature. In this scale-aware perspective, we find that unlike self-consistency, certain strategies such as multi-agent debate or Reflexion can become worse if more compute budget is utilized.
翻译:为激发大语言模型的能力,学界已提出多种推理策略。然而,本文指出传统评估方法仅关注性能指标,忽略了关键因素:额外计算资源带来的效能提升。忽视这一维度往往导致对策略效率的片面认知。本文提出一种将计算预算纳入考量的评估框架,通过同时权衡性能指标与计算成本,提供更具参考价值的比较分析。在这种预算感知的视角下,我们发现复杂推理策略的优势往往并非源于算法创新,而是源于分配了更多计算资源。当为链式思维自洽性等简单基线策略提供相当的计算资源时,其表现常优于文献中提出的复杂推理策略。在这种规模感知的视角下,我们发现与自洽性策略不同,多智能体辩论或反思等策略若分配更多计算预算,其性能反而可能下降。