In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO). MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts: the major part that contains the major information (central tensor) and the supplementary part that only has a small proportion of parameters (auxiliary tensors). Based on such a decomposition, our architecture shares the central tensor across all layers for reducing the model size and meanwhile keeps layer-specific auxiliary tensors (also using adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPO-based architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in reducing the model size and achieving highly competitive performance.
翻译:本文提出了一种高度参数高效的方法,用于将预训练语言模型(PLMs)扩展到更深的模型深度。与之前所有参数共享或使用额外模块的工作不同,我们基于矩阵乘积算子(MPO)设计了一种更具能力的参数共享架构。MPO分解能够将参数矩阵的信息重新组织并分解为两部分:包含主要信息的主体部分(中心张量)和仅占少量参数的补充部分(辅助张量)。基于这种分解,我们的架构在所有层之间共享中心张量以减少模型规模,同时保留每层特定的辅助张量(也使用适配器)以增强适应灵活性。为改进模型训练,我们进一步提出了一种针对基于MPO架构的稳定初始化算法。大量实验证明了我们提出的模型在减少模型规模和实现高度竞争性能方面的有效性。