Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.
翻译:近期理论工作发现,标准Transformer在读取输入后立即输出答案时,存在一些意外简单的推理问题(如图中两点连通性判断、有限状态机模拟)是理论上无法解决的。然而在实践中,通过允许Transformer使用"思维链"或"草稿板"(即在输出答案前生成并基于中间token序列进行条件计算)可提升其推理能力。受此启发,我们提出:中间生成机制是否从根本上扩展了解码器-only Transformer的计算能力?答案是肯定的,但能力提升幅度关键取决于中间生成量。例如,我们发现解码步数(相对于输入长度)为对数数量级的Transformer解码器仅能略微提升标准Transformer的极限,而线性解码步数则赋予其显著新能力(在标准复杂性猜想下):可识别所有正则语言。我们的结果还表明,线性步数使Transformer解码器保持在上下文相关语言范围内,而多项式步数使其精确识别多项式时间可解问题类——这是首个用标准复杂性类对Transformer类型进行的精确刻画。综上,我们的研究为理解Transformer思维链或草稿板长度如何影响其推理能力提供了精细化的分析框架。