Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify transformer LLMs, we investigate the limits of these models across three representative compositional tasks -- multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations' performance can rapidly decay with\,increased\,task\,complexity.
翻译:Transformer大语言模型(LLMs)在需要复杂多步推理的任务中展现出卓越性能,令人赞叹。然而,这些模型同时也暴露出在看似简单问题上的失败。这引发了一个关键问题:这些错误是偶然的,还是暗示着更本质的局限性?为揭示Transformer大语言模型的本质,我们研究了这些模型在三种代表性组合任务——多位数乘法、逻辑网格谜题及经典动态规划问题中的表现极限。这些任务要求将问题分解为子步骤,并将这些步骤整合为精确答案。我们将组合任务形式化为计算图,系统量化复杂度层级,并将推理步骤分解为中间子过程。实验结果表明,Transformer大语言模型通过将多步组合推理简化为线性化子图匹配来求解组合任务,而未必发展出系统性的问题解决能力。为完善实证研究,我们针对抽象多步推理问题提出理论论证,揭示了自回归生成性能如何随任务复杂度增加而快速衰减的内在机理。