There has been a growing interest in the ability of neural networks to solve algorithmic tasks, such as arithmetic, summary statistics, and sorting. While state-of-the-art models like Transformers have demonstrated good generalization performance on in-distribution tasks, their out-of-distribution (OOD) performance is poor when trained end-to-end. In this paper, we focus on value generalization, a common instance of OOD generalization where the test distribution has the same input sequence length as the training distribution, but the value ranges in the training and test distributions do not necessarily overlap. To address this issue, we propose that using fixed positional encodings to determine attention weights-referred to as positional attention-enhances empirical OOD performance while maintaining expressivity. We support our claim about expressivity by proving that Transformers with positional attention can effectively simulate parallel algorithms.
翻译:近年来,神经网络解决算法任务(如算术运算、汇总统计和排序)的能力日益受到关注。尽管Transformer等最先进模型在分布内任务上展现出良好的泛化性能,但在端到端训练时,其分布外(OOD)性能表现较差。本文聚焦于值泛化——一种常见的OOD泛化实例,其中测试分布与训练分布具有相同的输入序列长度,但训练与测试分布中的数值范围未必重叠。为解决该问题,我们提出使用固定位置编码来确定注意力权重(称为位置注意力)的方法,该方法能在保持表达力的同时提升经验性OOD性能。我们通过证明配备位置注意力的Transformer可有效模拟并行算法,从而为表达力主张提供理论支撑。