Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply training on longer sequences is inefficient due to the quadratic computation complexity of the global attention mechanism. In this work, we demonstrate that this failure mode is linked to positional encodings being out-of-distribution for longer sequences (even for relative encodings) and introduce a novel family of positional encodings that can overcome this problem. Concretely, our randomized positional encoding scheme simulates the positions of longer sequences and randomly selects an ordered subset to fit the sequence's length. Our large-scale empirical evaluation of 6000 models across 15 algorithmic reasoning tasks shows that our method allows Transformers to generalize to sequences of unseen length (increasing test accuracy by 12.0% on average).
翻译:Transformer在固定上下文长度的任务上具有显著的泛化能力,但即使面对诸如字符串复制等看似简单的任务,它们也无法泛化到任意长度的序列。此外,由于全局注意力机制的二次计算复杂度,直接训练更长序列的效率低下。在本工作中,我们证明这种失败模式与位置编码在更长序列上出现分布外问题(即使对于相对编码也是如此)有关,并提出一类能克服该问题的新型位置编码。具体而言,我们的随机化位置编码方案模拟了更长序列的位置,并随机选择有序子集来适配序列长度。通过对15个算法推理任务中6000个模型的大规模实证评估表明,我们的方法使Transformer能够泛化到未见过的序列长度(平均测试准确率提升12.0%)。