Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER Attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 2.2 billion parameters where we show upto 3.38% and an average of ~1% improvement over standard attention on downstream evaluations. Using LASER gives the following relative improvements in generalization performance across a variety of tasks (vision, text and speech): 4.67% accuracy in Vision Transformer (ViT) on Imagenet, 2.25% error rate in Conformer on the Librispeech speech-to-text and 0.93% fraction of incorrect predictions in BERT with 2.2 billion parameters.
翻译:Transformer模型在多个序列相关任务中产生了巨大影响,这主要归功于其通过基于softmax的点积注意力机制能够从序列的任何部分检索信息。该机制对Transformer的性能起着至关重要的作用。我们分析了通过注意力机制中softmax操作反向传播的梯度,观察到这些梯度通常可能较小。这种不良的梯度信号反向传播可能导致注意力操作前参数的训练效率低下。为此,我们引入了一种名为LASER的新型注意力机制,并通过分析证明其能够获得更大的梯度信号。我们展示了LASER注意力可以通过对现有注意力实现进行微小修改来实现。我们在参数规模高达22亿的自回归大语言模型(LLM)上进行了实验,结果表明在下游评估中,相比标准注意力机制,LASER取得了最高3.38%、平均约1%的性能提升。使用LASER在各种任务(视觉、文本和语音)的泛化性能上带来了以下相对改进:Vision Transformer(ViT)在Imagenet上的准确率提升4.67%,Conformer在Librispeech语音转文本任务上的错误率降低2.25%,以及参数为22亿的BERT模型的错误预测比例降低0.93%。