Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.
翻译:链式思维(Chain-of-Thought, CoT)已被实证表明能提升Transformer的性能,并在理论上将其表达能力增强至图灵完备性。然而,Transformer能否学会泛化到比训练时更长的CoT轨迹仍研究不足。我们利用近期关于Transformer长度泛化的理论框架发现——在标准位置编码和有限字母表条件下——采用CoT的Transformer无法解决超出$TC^0$类的问题,即表达能力优势在长度可泛化可学习性的更严格约束下不成立。但若允许词汇表随问题规模增长,则可实现对图灵机的长度可泛化模拟,此时CoT轨迹长度与模拟运行时间呈线性关系(至多相差常数倍)。我们的构造克服了可靠长度泛化的两个核心障碍:重复复制与末次出现检索。为每个磁带位置分配唯一的路标令牌,仅记录值变化以通过计数恢复当前磁带符号,从而规避上述两个障碍。此外,实证表明此类路标令牌与值变化编码可为改进困难问题上的长度泛化提供可操作指导。