Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated by modifying prompts to include examples with chains of thought--demonstrations of solution procedures--with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examine the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations and depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially because of the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.
翻译:大型语言模型在推理问题上的表现通常无法泛化至分布外场景。先前研究认为,通过修改提示词加入包含思维链的示例(即解题过程的演示)可缓解此问题,其直觉依据在于能够通过上下文向大语言模型传授解决问题的算法。本文以经典规划领域"积木世界"问题为案例,沿两个维度系统考察思维链机制:提示词中示例的泛化性,以及每个提示词所查询问题的复杂度。尽管所研究问题极为简单,但我们仅在使用高度特化于问题类别的思维链提示词时观察到实质性性能提升,且当查询指定的堆栈规模n超过示例堆栈规模时,该性能提升迅速衰减。研究结果表明,与既有文献主张相反,思维链的性能提升并非源于模型通过示范学习通用算法流程,而是高度依赖精心设计的特异性问题提示词。这揭示了思维链机制的局限性,尤其在性能增益与生成正确推理链示例所需人力成本之间存在尖锐权衡。