Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
翻译:即使在整数加法这类简单算术任务中,Transformer也难以泛化到比训练序列更长的序列。为解决此问题,我们提出位置耦合——一种将任务结构直接嵌入(仅解码器)Transformer位置编码的简洁有效方法。不同于为每个标记分配唯一位置ID的原始绝对位置机制,我们为两个或多个“相关”标记分配相同的位置ID:对于整数加法任务,我们将相同数位上的数字视为处于相同位置。在实证方面,我们证明采用所提位置耦合的模型在1至30位数加法上训练后,可泛化至200位数加法(训练长度的6.67倍)。在理论方面,我们证明具有耦合位置的单层Transformer能解决涉及指数级位数加法任务,而任何无位置信息的单层Transformer均无法完全解决该任务。我们还展示了位置耦合可应用于其他算法任务,如Nx2乘法及二维任务。