Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model. We challenge the classical matrix factorization methods by proposing a novel, better-performing autoencoder-based framework. We perform a comprehensive ablation study of our approach, examining its different aspects over a diverse set of evaluation settings. Moreover, we show that enabling collaboration between modules across layers by compressing certain modules together positively impacts the final model performance. Experiments on various NLP tasks demonstrate that our approach significantly outperforms commonly used factorization-based offline compression methods.
翻译:近年来,Transformer语言模型在众多自然语言处理任务中取得了显著成果。然而,其庞大的模型规模常使其难以应用于内存受限设备,迫使研究者将其压缩为更小的网络。本文探索离线压缩方法,即无需对压缩模型进行微调的计算轻量级方案。我们通过提出一种性能更优的新型自编码器框架,挑战了经典矩阵分解方法。我们对该方法进行了全面的消融研究,在多种评估设置下考察其不同方面的表现。此外,研究表明,通过跨层联合压缩模块促进层间协作,能够对最终模型性能产生积极影响。在各类自然语言处理任务上的实验证明,本文方法显著优于常用的基于分解的离线压缩方法。