The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
翻译:Transformer架构包含两个主要的非嵌入组件:注意力机制和前馈网络。注意力机制能够捕获词语之间的相互依赖关系,不受位置限制,而前馈网络则独立地对每个输入词元进行非线性变换。本研究探讨了前馈网络的作用,发现尽管其占据了模型参数的相当大一部分,但存在高度冗余性。具体而言,通过移除解码器层中的前馈网络并在编码器中共享单个前馈网络,我们能够在仅轻微降低准确率的情况下大幅减少参数量。最后,我们将该架构通过增加共享前馈网络的隐藏维度恢复到原始尺寸,在与原始Transformer Big模型相比时,在准确率和延迟两方面均取得了显著提升。