Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.
翻译:注意力机制中不必要的上下文元素会降低模型性能。我们提出选择性注意力机制,这是一种对标准注意力机制进行简单且无需参数的改进,旨在减少对不必要元素的注意力分配。选择性注意力机制在多种模型规模和上下文长度下均能提升语言建模性能。例如,在C4数据集上使用语言建模目标训练的一系列配备选择性注意力机制的Transformer模型,其性能相当于注意力模块头数和参数量约多2倍的标准Transformer模型。选择性注意力机制还允许减小注意力上下文缓冲区的尺寸,从而在推理阶段显著降低内存和计算需求。例如,在C4数据集上训练的1亿参数Transformer模型,当上下文尺寸分别为512、1024和2048时,配备选择性注意力机制的模型在保持相同验证困惑度的前提下,其注意力模块所需内存分别降至未配备该机制模型的1/16、1/25和1/47。