Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.
翻译:近期理论工作发现了一些看似简单的推理问题(例如判断图两节点是否连通或模拟有限状态机),这些问题的确无法通过标准Transformer在读取输入后立即作答来解决。然而在实践中,允许Transformer使用"链式思维"或"草稿板"(即生成并基于中间token序列进行推理)可提升其推理能力。受此启发,我们提出疑问:这种中间生成过程是否能从根本上扩展仅解码器架构Transformer的计算能力?我们的回答是肯定的,但能力提升幅度高度依赖中间生成步骤的长度。例如,我们发现对数级解码步长(相对于输入长度)仅轻微突破标准Transformer的极限;而在线性级解码步长下(采用投影预归一化——标准预归一化的轻微泛化),根据标准复杂性猜想,Transformer将获得明确的新能力:可识别所有正则语言。进一步结果表明,线性步长使Transformer解码器保持在上下文相关语言范围内,而广义预归一化下的多项式步长则使其能精确识别多项式时间可解问题类——这是首次以标准复杂性类对某类Transformer进行精确刻画。这些发现共同构建了理解Transformer链式思维或草稿板长度如何影响其推理能力的精细框架。