We introduce Krony-PT, a compression technique for GPT-2 based on Kronecker products. We specifically target the feed-forward weights of each transformer block, and systematically compress the feed-forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize new Kronecker factors, and also propose a new pruning-based initialization technique. Our method compresses the original 124M-parameter GPT-2 to various smaller models, ranging from 80M to 96M. Our 81M model variant outperforms DistilGPT2 on next-token prediction across all standard language modeling datasets, and shows competitive or comparable performance with significantly larger Kronecker-based compressions of GPT-2.
翻译:我们提出了Krony-PT,一种基于克罗内克积的GPT-2压缩技术。该方法专门针对每个Transformer模块的前馈权重,系统地将前馈层矩阵压缩至不同程度。我们引入了一种改进的Van Loan分解方法来初始化新的克罗内克因子,并提出了一种新的基于剪枝的初始化技术。我们的方法将原始的1.24亿参数GPT-2压缩为多个更小的模型,参数规模从8000万到9600万不等。其中8100万参数的模型变体在所有标准语言建模数据集的下一个词预测任务上均优于DistilGPT2,并与基于克罗内克积的、规模显著更大的GPT-2压缩模型展现出竞争性或相当的性能。