Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.
翻译:尽管Transformer在语言理解、代码生成和逻辑推理方面取得了成功,但在加法与乘法等基础算术任务上仍无法实现长度泛化。这一失败的主要原因在于数字与文本在结构上存在巨大差异:例如,数字通常从右向左解析,且不同数字中相同数位的数字具有对应关系。相比之下,文本中此类对称性则极不自然。本研究提出通过改进数字格式化与定制位置编码,将这些语义显式编码到模型中。实验表明,我们的方法使Transformer在仅使用最多5位数进行加法与乘法训练后,能够泛化至50位数,无需使用更长序列的额外数据。我们进一步证明,即使使用捕获任务对称性的增强数据进行训练,传统绝对位置编码仍无法泛化到更长序列。为阐明显式编码结构的重要性,我们证明了通过位置编码显式融入结构是实现分布外泛化的必要条件。最后,我们指出了除捕获对称性外长度泛化固有的其他挑战,特别是底层任务的复杂性,并提出了通过调整训练分布来解决这些挑战的方法。