Less is More! A slim architecture for optimal language translation

The softmax attention mechanism has emerged as a noteworthy development in the field of Artificial Intelligence research, building on the successes of Transformer-based architectures. However, their ever increasing sizes necessitate ever increasing computational memory, that limits their usage. We propose KgV, a sigmoid gating mechanism that, in conjunction with softmax attention, significantly boosts performance without increasing architecture size. To amend the size requirements, we leverage Tensor Chains to identify and prune the excess parameters. We find that such excess resides primarily within the embedding layer, and not in the output linear layer. To further improve embedding and significantly reduce parameters, we introduce H-SoftPOS, a hierarchical embedding layer which simultaneously enhances performance. Remarkably, on the WMT14 English-German validation set, our approach yields a threefold reduction in perplexity, surpassing the current state-of-the-art, while reducing parameter counts also by a factor of 3. When we further reduce the number of parameters up to sevenfold, we can still achieve a 21\% decrease in perplexity with respect to the baseline Transformer. To understand generalization capabilities, we conduct experiments on the 7 language pairs of the WMT17 dataset. Our method outperforms existing techniques in terms of test loss while simultaneously halving the number of parameters. Moreover, we observe a 70 times reduction in variance with respect to the prior state-of-the-art. In conclusion, our proposed method yields significant improvements in performance and much lower memory cost. We call the resulting architecture Anthe.

翻译：软注意力机制基于Transformer架构的成功，已成为人工智能研究领域一项值得关注的进展。然而，其日益庞大的规模导致计算内存需求不断攀升，从而限制了实际应用。我们提出KgV——一种与软注意力协同作用的Sigmoid门控机制，在不增加架构规模的情况下显著提升性能。为应对规模需求，我们利用张量链识别并修剪冗余参数。研究发现，冗余参数主要存在于嵌入层而非输出线性层。为进一步优化嵌入并大幅减少参数量，我们引入分层嵌入层H-SoftPOS，该结构同时提升了性能表现。值得注意的是，在WMT14英德验证集上，我们的方法使困惑度降低三倍，超越当前最先进水平，同时参数量也缩减至三分之一。即使将参数量进一步压缩至七分之一，相较于基线Transformer仍能实现21%的困惑度下降。为验证泛化能力，我们在WMT17数据集涉及7个语言对的实验中开展测试。该方法在测试损失方面优于现有技术，同时参数量减半。此外，与前代最优方法相比，方差降低70倍。综上，我们提出的方法在性能提升与内存成本降低方面均取得显著进展。我们将该架构命名为Anthe。