Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as addition with multiple summands, Nx2 multiplication, copy/reverse, and a two-dimensional task.
翻译:即使在整数加法这类简单算术任务中,Transformer也难以泛化到比训练序列更长的场景。为解决此问题,我们提出位置耦合——一种将任务结构直接嵌入(仅解码器)Transformer位置编码的简洁有效方法。不同于为每个词元分配唯一位置ID的标准绝对位置机制,我们为两个或多个"相关"词元分配相同的位置ID:在整数加法任务中,我们将相同数位上的数字视为处于相同位置。实证研究表明,采用所提出的位置耦合方法,在1至30位数加法上训练的小型(1层)Transformer可泛化至200位数加法(训练长度的6.67倍)。理论分析证明,具有耦合位置的1层Transformer能够解决涉及指数级位数加法的任务,而任何无位置信息的1层Transformer均无法完全解决该任务。我们还验证了位置耦合可应用于其他算法任务,包括多被加数加法、Nx2乘法、复制/反转及二维任务。