Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.
翻译:长度泛化指从短训练序列外推至长测试序列的能力,这对当前大型语言模型构成挑战。先前研究虽提出某些架构或数据格式变更以实现长度泛化,但这些方案通常仅适用于有限任务集。基于先前的草稿纸和思维链技术,我们提出图灵程序——一种新颖的思维链策略,将算法任务分解为模拟图灵机计算的步骤。该框架兼具普适性(可适配任何算法任务)与简洁性(仅需复制上下文文本并进行微小修改)。实验表明,通过图灵程序可在加法、乘法及上下文随机梯度下降等算法任务上实现稳健的长度泛化。我们进一步证明,Transformer模型在随机图灵程序上可实现长度泛化,这意味着长度泛化适用于所有算法任务。最后,我们通过理论证明Transformer可实现图灵程序,构建了一个可模拟任意图灵机的简单RASP程序。