The design choices in Transformer feed-forward neural networks have resulted in significant computational and parameter overhead. In this work, we emphasize the importance of hidden dimension in designing lightweight FFNs, a factor often overlooked in previous architectures. Guided by this principle, we introduce PartialFormer, a parameter-efficient Transformer architecture utilizing multiple smaller FFNs to reduce parameters and computation while maintaining essential hidden dimensions. These smaller FFNs are integrated into a multi-head attention system to enable effective collaboration. We also propose a tailored head scaling strategy to enhance PartialFormer's capabilities. Furthermore, we present a residual-like attention calculation to improve depth scaling within PartialFormer. Extensive experiments on 9 translation tasks and 1 abstractive summarization task validate the effectiveness of our PartialFormer approach. Our code would be available at: \url{https://github.com/zhengkid/PartialFormer}.
翻译:Transformer前馈神经网络中的设计选择导致了显著的计算和参数开销。本研究强调了隐藏维度在轻量级FFN设计中的重要性,这是先前架构中常被忽视的因素。基于这一原则,我们提出了PartialFormer——一种参数高效的Transformer架构,通过采用多个较小的FFN来减少参数量和计算量,同时保持必要的隐藏维度。这些小型FFN被集成到多头注意力系统中,以实现有效协作。我们还提出了一种定制化头缩放策略来增强PartialFormer的能力。此外,我们引入了一种类残差注意力计算方法,以改善PartialFormer中的深度扩展能力。在9个翻译任务和1个抽象式摘要任务上的大量实验验证了我们PartialFormer方法的有效性。我们的代码将发布于:\url{https://github.com/zhengkid/PartialFormer}。