We examine how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on $5$-digit numbers can perform $15$-digit sums. However, this method fails for multiplication, and we propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $\times$ $3$-digit multiplications to generalize to $35\times 3$ examples. We also show that models can be primed for different generalization lengths, and that the priming sample size scales as the logarithm of the training set size. Finally, we discuss potential applications of priming beyond arithmetic.
翻译:我们研究了Transformer如何应对两个挑战:学习基本整数算术运算,以及泛化到比训练时更长的序列。我们发现相对位置嵌入能够使模型在简单任务(如加法)上实现长度泛化:仅在5位数上训练的模型可以完成15位数的求和。然而,该方法在乘法任务上失效,为此我们提出训练集预置法:在训练集中添加少量(10到50个)长序列。实验表明,这种预置方法能使仅在5位数×3位数乘法上训练的模型泛化到35×3的示例。我们还证明了模型可以针对不同泛化长度进行预置,且预置样本数量与训练集大小呈对数关系。最后,我们探讨了预置方法在算术任务以外的潜在应用。