In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware Training (QAT) to enhance performance through additional training. Results on GPT-2 124M show a major enhancement for quantized PoT model after additional training, with a perplexity enhancement of 66% and BERT-Score loss to baseline GPT-2 of 1%. The memory saving is estimated to be 87.5% while the inference speed is expected to be 3-10x faster with PoT quantization versus full-precision.
翻译:在大语言模型(LLMs)中,参数数量在过去几年呈指数级增长,例如从GPT-2的15亿参数增至GPT-3的1750亿,更高版本甚至可能超过万亿。这给实际部署带来了重大挑战,尤其对于边缘设备而言。与云计算不同,边缘设备的存储和处理能力非常有限,因此需要开发创新方法以实现此类应用的可行性。本研究探索了一种特殊的量化方法对权重进行压缩,将数值限制为仅二次幂(PoT)。这不仅通过仅需存储指数值显著节省大量内存,更重要的是通过用低成本位移运算替代昂贵的乘法操作,大幅降低了处理功耗。为克服这种严格量化导致的性能损失,我们研究了量化感知训练(QAT)方法,通过附加训练提升模型性能。在GPT-2 124M模型上的实验表明,经过附加训练的量化PoT模型性能得到显著提升:困惑度改善达66%,与基准GPT-2相比的BERT-Score损失仅为1%。预计内存节省可达87.5%,同时推理速度相比全精度模型有望提升3-10倍。