Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify Transformers, we investigate the limits of these models across three representative compositional tasks -- multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how Transformers' performance will rapidly decay with increased task complexity.
翻译:Transformer大规模语言模型在需要复杂多步推理的任务中展现出卓越性能,令人赞叹。然而,这些模型同时会在一些出奇简单的问题上失败。这引发了一个问题:这些错误是偶然的,还是暗示了更本质的局限?为探究Transformer的内在机理,我们研究了这三类代表性组合任务——多位数乘法、逻辑网格谜题及经典动态规划问题——中模型的局限性。这些任务要求将问题分解为子步骤,并将这些步骤综合成精确答案。我们将组合任务表述为计算图,以系统量化复杂程度,并将推理步骤分解为中间子过程。实证结果表明,Transformer通过将多步组合推理简化为线性化子图匹配来解决组合任务,而未必发展出系统性的问题解决技能。为完善实证研究,我们针对抽象多步推理问题提出了理论论证,揭示了随着任务复杂度增加,Transformer的性能会迅速衰减。