High-dimensional token embeddings underpin Large Language Models (LLMs), as they can capture subtle semantic information and significantly enhance the modelling of complex language patterns. However, this high dimensionality also introduces considerable model parameters and prohibitively high model storage and memory requirements, which is particularly unaffordable for low-end devices. Targeting no extra training data and insufficient computation cases, we propose a training-free model compression approach based on the Tensor-Train Decomposition (TTD), whereby each pre-trained token embedding is converted into a lower-dimensional Matrix Product State (MPS). We then comprehensively investigate the low-rank structures extracted by this approach, in terms of the compression ratio, the language task performance, and latency on a typical low-end device (i.e. Raspberry Pi). Taking GPT family models (i.e. GPT-2 and CerebrasGPT) as case studies, our approach theoretically results in $46.89\%$ fewer parameters of the entire model, with a compression ratio $39.38\times$ - $65.64\times$ for the embedding layers. With different hyperparameter choices, the model compressed with our approach can achieve a comparable language task performance to the original model with around $2.0\times$ embedding layer compression. This empirically proves the existence of low-rank structure in GPT family models, and demonstrates that about half of the parameters in the embedding layers are redundant.
翻译:高维词元嵌入是大型语言模型(LLMs)的核心基础,因其能捕捉细微的语义信息并显著增强复杂语言模式的建模能力。然而,这种高维特性也带来了大量模型参数以及极高的模型存储与内存需求,这对低端设备而言尤为难以承受。针对无额外训练数据且计算资源受限的场景,我们提出一种基于张量序列分解(TTD)的无训练模型压缩方法,将每个预训练词元嵌入转换为低维矩阵乘积态(MPS)。我们系统研究了该方法提取的低秩结构特性,涵盖压缩比、语言任务性能及典型低端设备(如树莓派)上的延迟表现。以GPT系列模型(包括GPT-2与CerebrasGPT)为案例,我们的方法理论上可使整体模型参数量减少46.89%,嵌入层的压缩比达到39.38倍至65.64倍。通过调整超参数配置,经本方法压缩的模型在嵌入层压缩约2.0倍时,仍能获得与原始模型相当的语言任务性能。这从实证角度证明了GPT系列模型中存在低秩结构,并表明嵌入层中约半数参数具有冗余性。