Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.
翻译:Transformer架构最初作为机器翻译模型被提出,如今已成为现代深度学习架构的基石,其应用领域从计算机视觉延伸到自然语言处理等多个方向。当前,为应对日益复杂的任务,基于Transformer的模型规模不断扩展,需要越来越庞大的训练数据集和难以持续承受的计算资源。因此,Transformer及其核心组件——注意力机制——的普遍性使其成为效率研究的重点目标。本研究针对Transformer架构提出的自注意力机制,提出了一种替代的兼容性函数。该函数利用传统缩放点积注意力在学习表示中的重叠特性,推导出具有成对系数的对称点积注意力机制。在类BERT模型的预训练中应用该对称注意力机制后,GLUE基准测试得分达到79.36(传统实现为78.74),可训练参数量减少6%,且收敛所需训练步数减半。